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Preface 


Machine learning is a fascinating area to work in: from detecting anomalous events 
in live streams of sensor data to identifying emergent topics involving text collection, 
exciting problems are never too far away. 

Quantum information theory also teems with excitement. By manipulating particles 
at a subatomic level, we are able to perform Fourier transformation exponentially 
faster, or search in a database quadratically faster than the classical limit. Superdense 
coding transmits two classical bits using just one qubit. Quantum encryption is 
unbreakable—at least in theory. 

The fundamental question of this monograph is simple: What can quantum 
computing contribute to machine learning? We naturally expect a speedup from 
quantum methods, but what kind of speedup? Quadratic? Or is exponential speedup 
possible? It is natural to treat any form of reduced computational complexity with 
suspicion. Are there tradeoffs in reducing the complexity? 

Execution time is just one concern of learning algorithms. Can we achieve higher 
generalization performance by turning to quantum computing? After all, training 
error is not that difficult to keep in check with classical algorithms either: the 
real problem is finding algorithms that also perform well on previously unseen 
instances. Adiabatic quantum optimization is capable of finding the global optimum 
of nonconvex objective functions. Grover’s algorithm finds the global minimum in a 
discrete search space. Quantum process tomography relies on a double optimization 
process that resembles active learning and transduction. How do we rephrase learning 
problems to fit these paradigms? 

Storage capacity is also of interest. Quantum associative memories, the quantum 
variants of Hopfield networks, store exponentially more patterns than their classical 
counterparts. How do we exploit such capacity efficiently? 

These and similar questions motivated the writing of this book. The literature on the 
subject is expanding, but the target audience of the articles is seldom the academics 
working on machine learning, not to mention practitioners. Coming from the other 
direction, quantum information scientists who work in this area do not necessarily 
aim at a deep understanding of learning theory when devising new algorithms. 

This book addresses both of these communities: theorists of quantum computing 
and quantum information processing who wish to keep up to date with the wider 
context of their work, and researchers in machine learning who wish to benefit from 
cutting-edge insights into quantum computing. 
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Notations 


indicator function 
set of complex numbers 
number of dimensions in the feature space 
error 
expectation value 
group 
Hamiltonian 
Hilbert space 
identity matrix or identity operator 
number of weak classifiers or clusters, nodes in a neural net 
number of training instances 
measurement: projective or POVM 
probability measure 
set of real numbers 
density matrix 
Oy,0z Pauli matrices 
trace of a matrix 
unitary time evolution operator 
weight vector 
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X, Xi data instance 

X matrix of data instances 
Y» Yi label 

T transpose 


Hermitian conjugate 

Ii- norm of a vector 

ee commutator of two operators 

® tensor product 

D XOR operation or direct sum of subspaces 
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Fundamental Concepts 
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Introduction 


The quest of machine learning is ambitious: the discipline seeks to understand 
what learning is, and studies how algorithms approximate learning. Quantum machine 
learning takes these ambitions a step further: quantum computing enrolls the help of 
nature at a subatomic level to aid the learning process. 

Machine learning is based on minimizing a constrained multivariate function, and 
these algorithms are at the core of data mining and data visualization techniques. The 
result of the optimization is a decision function that maps input points to output points. 
While this view on machine learning is simplistic, and exceptions are countless, some 
form of optimization is always central to learning theory. 

The idea of using quantum mechanics for computations stems from simulating 
such systems. Feynman (1982) noted that simulating quantum systems on classical 
computers becomes unfeasible as soon as the system size increases, whereas quantum 
particles would not suffer from similar constraints. Deutsch (1985) generalized the 
idea. He noted that quantum computers are universal Turing machines, and that 
quantum parallelism implies that certain probabilistic tasks can be performed faster 
than by any classical means. 

Today, quantum information has three main specializations: quantum computing, 
quantum information theory, and quantum cryptography (Fuchs, 2002, p. 49). We 
are not concerned with quantum cryptography, which primarily deals with secure 
exchange of information. Quantum information theory studies the storage and 
transmission of information encoded in quantum states; we rely on some concepts 
such as quantum channels and quantum process tomography. Our primary focus, 
however, is quantum computing, the field of inquiry that uses quantum phenomena 
such as superposition, entanglement, and interference to operate on data represented 
by quantum states. 

Algorithms of importance emerged a decade after the first proposals of quantum 
computing appeared. Shor (1997) introduced a method to factorize integers expo- 
nentially faster, and Grover (1996) presented an algorithm to find an element in 
an unordered data set quadratically faster than the classical limit. One would have 
expected a slew of new quantum algorithms after these pioneering articles, but the 
task proved hard (Bacon and van Dam, 2010). Part of the reason is that now we expect 
that a quantum algorithm should be faster—we see no value in a quantum algorithm 
with the same computational complexity as a known classical one. Furthermore, even 


Quantum Machine Learning. http://dx.doi.org/10.1016/B978-0- 12-800953-6.00001-3 
© 2014 Elsevier Inc. All rights reserved. 


4 Quantum Machine Learning 


with the spectacular speedups, the class NP cannot be solved on a quantum computer 
in subexponential time (Bennett et al., 1997). 

While universal quantum computers remain out of reach, small-scale experiments 
implementing a few qubits are operational. In addition, quantum computers restricted 
to domain problems are becoming feasible. For instance, experimental validation of 
combinatorial optimization on over 500 binary variables on an adiabatic quantum 
computer showed considerable speedup over optimized classical implementa- 
tions (McGeoch and Wang, 2013). The result is controversial, however (Rønnow 
et al., 2014). 

Recent advances in quantum information theory indicate that machine learning 
may benefit from various paradigms of the field. For instance, adiabatic quantum 
computing finds the minimum of a multivariate function by a controlled physical 
process using the adiabatic theorem (Farhi et al., 2000). The function is translated to 
a physical description, the Hamiltonian operator of a quantum system. Then, a system 
with a simple Hamiltonian is prepared and initialized to the ground state, the lowest 
energy state a quantum system can occupy. Finally, the simple Hamiltonian is evolved 
to the target Hamiltonian, and, by the adiabatic theorem, the system remains in the 
ground state. At the end of the process, the solution is read out from the system, and 
we obtain the global optimum for the function in question. 

While more and more articles that explore the intersection of quantum computing 
and machine learning are being published, the field is fragmented, as was already 
noted over a decade ago (Bonner and Freivalds, 2002). This should not come as a 
surprise: machine learning itself is a diverse and fragmented field of inquiry. We 
attempt to identify common algorithms and trends, and observe the subtle interplay 
between faster execution and improved performance in machine learning by quantum 
computing. 

As an example of this interplay, consider convexity: it is often considered a 
virtue in machine learning. Convex optimization problems do not get stuck in local 
extrema, they reach a global optimum, and they are not sensitive to initial conditions. 
Furthermore, convex methods have easy-to-understand analytical characteristics, and 
theoretical bounds on convergence and other properties are easier to derive. Non- 
convex optimization, on the other hand, is a forte of quantum methods. Algorithms 
on classical hardware use gradient descent or similar iterative methods to arrive at 
the global optimum. Quantum algorithms approach the optimum through an entirely 
different, more physical process, and they are not bound by convexity restrictions. 
Nonconvexity, in turn, has great advantages for learning: sparser models ensure better 
generalization performance, and nonconvex objective functions are less sensitive to 
noise and outliers. For this reason, numerous approaches and heuristics exist for 
nonconvex optimization on classical hardware, which might prove easier and faster 
to solve by quantum computing. 

As in the case of computational complexity, we can establish limits on the 
performance of quantum learning compared with the classical flavor. Quantum 
learning is not more powerful than classical learning—at least from an information- 
theoretic perspective, up to polynomial factors (Servedio and Gortler, 2004). On 
the other hand, there are apparent computational advantages: certain concept classes 
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are polynomial-time exact-learnable from quantum membership queries, but they 
are not polynomial-time learnable from classical membership queries (Servedio and 
Gortler, 2004). Thus quantum machine learning can take logarithmic time in both the 
number of vectors and their dimension. This is an exponential speedup over classical 
algorithms, but at the price of having both quantum input and quantum output (Lloyd 
et al., 2013a). 


1.1 Learning Theory and Data Mining 


Machine learning revolves around algorithms, model complexity, and computational 
complexity. Data mining is a field related to machine learning, but its focus is 
different. The goal is similar: identify patterns in large data sets, but aside from 
the raw analysis, it encompasses a broader spectrum of data processing steps. Thus, 
data mining borrows methods from statistics, and algorithms from machine learning, 
information retrieval, visualization, and distributed computing, but it also relies on 
concepts familiar from databases and data management. In some contexts, data mining 
includes any form of large-scale information processing. 

In this way, data mining is more applied than machine learning. It is closer to what 
practitioners would find useful. Data may come from any number of sources: business, 
science, engineering, sensor networks, medical applications, spatial information, and 
surveillance, to mention just a few. Making sense of the data deluge is the primary 
target of data mining. 

Data mining is a natural step in the evolution of information systems. Early 
database systems allowed the storing and querying of data, but analytic functionality 
was limited. As databases grew, a need for automatic analysis emerged. At the same 
time, the amount of unstructured information—text, images, video, music—exploded. 
Data mining is meant to fill the role of analyzing and understanding both structured 
and unstructured data collections, whether they are in databases or stored in some 
other form. 

Machine learning often takes a restricted view on data: algorithms assume either a 
geometric perspective, treating data instances as vectors, or a probabilistic one, where 
data instances are multivariate random variables. Data mining involves preprocessing 
steps that extract these views from data. 

For instance, in text mining—data mining aimed at unstructured text documents— 
the initial step builds a vector space from documents. This step starts with identifi- 
cation of a set of keywords—that is, words that carry meaning: mainly nouns, verbs, 
and adjectives. Pronouns, articles, and other connectives are disregarded. Words that 
occur too frequently are also discarded: these differentiate only a little between two 
text documents. Then, assigning an arbitrary vector from the canonical basis to each 
keyword, an indexer constructs document vectors by summing these basis vectors. The 
summation includes a weighting, where the weighting reflects the relative importance 
of the keyword in that particular document. Weighting often incorporates the global 
importance of the keyword across all documents. 
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The resulting vector space—the term-document space—is readily analyzed by 
a whole range of machine learning algorithms. For instance, K-means clustering 
identifies groups of similar documents, support vector machines learn to classify 
documents to predefined categories, and dimensionality reduction techniques, such 
as singular value decomposition, improve retrieval performance. 

The data mining process often includes how the extracted information is presented 
to the user. Visualization and human-computer interfaces become important at this 
stage. Continuing the text mining example, we can map groups of similar documents 
on a two-dimensional plane with self-organizing maps, giving a visual overview of 
the clustering structure to the user. 

Machine learning is crucial to data mining. Learning algorithms are at the heart 
of advanced data analytics, but there is much more to successful data mining. While 
quantum methods might be relevant at other stages of the data mining process, we 
restrict our attention to core machine learning techniques and their relation to quantum 
computing. 


1.2 Why Quantum Computers? 


We all know about the spectacular theoretical results in quantum computing: factoring 
of integers is exponentially faster and unordered search is quadratically faster than 
with any known classical algorithm. Yet, apart from the known examples, finding an 
application for quantum computing is not easy. 

Designing a good quantum algorithm is a challenging task. This does not necessar- 
ily derive from the difficulty of quantum mechanics. Rather, the problem lies in our 
expectations: a quantum algorithm must be faster and computationally less complex 
than any known classical algorithm for the same purpose. 

The most recent advances in quantum computing show that machine learning might 
just be the right field of application. As machine learning usually boils down to a form 
of multivariate optimization, it translates directly to quantum annealing and adiabatic 
quantum computing. This form of learning has already demonstrated results on 
actual quantum hardware, albeit countless obstacles remain to make the method scale 
further. 

We should, however, not confine ourselves to adiabatic quantum computers. In 
fact, we hardly need general-purpose quantum computers: the task of learning is far 
more restricted. Hence, other paradigms in quantum information theory and quantum 
mechanics are promising for learning. Quantum process tomography is able to 
learn an unknown function within well-defined symmetry and physical constraints— 
this is useful for regression analysis. Quantum neural networks based on arbitrary 
implementation of qubits offer a useful level of abstraction. Furthermore, there is 
great freedom in implementing such networks: optical systems, nuclear magnetic 
resonance, and quantum dots have been suggested. Quantum hardware dedicated to 
machine learning may become reality much faster than a general-purpose quantum 
computer. 
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1.3 A Heterogeneous Model 


It is unlikely that quantum computers will replace classical computers. Why would 
they? Classical computers work flawlessly at countless tasks, from word processing 
to controlling complex systems. Quantum computers, on the other hand, are good at 
certain computational workloads where their classical counterparts are less efficient. 

Let us consider the state of the art in high-performance computing. Accelerators 
have become commonplace, complementing traditional central processing units. 
These accelerators are good at single-instruction, multiple-data-type parallelism, 
which is typical in computational linear algebra. Most of these accelerators derive 
from graphics processing units, which were originally designed to generate three- 
dimensional images at a high frame rate on a screen; hence, accuracy was not 
a consideration. With recognition of their potential in scientific computing, the 
platform evolved to produce high-accuracy double-precision floating point operations. 
Yet, owing to their design philosophy, they cannot accelerate just any workload. 
Random data access patterns, for instance, destroy the performance. Inherently single 
threaded applications will not show competitive speed on such hardware either. 
In contemporary high-performance computing, we must design algorithms using 
heterogeneous hardware: some parts execute faster on central processing units, others 
on accelerators. This model has been so successful that almost all supercomputers 
being built today include some kind of accelerator. 


If quantum computers become feasible, a similar model is likely to follow for at 
least two reasons: 


1. The control systems of the quantum hardware will be classical computers. 
2. Data ingestion and measurement readout will rely on classical hardware. 


More extensive collaboration between the quantum and classical realms is also 
expected. Quantum neural networks already hint at a recursive embedding of classical 
and quantum computing (Section 11.3). This model is the closest to the prevailing 
standards of high-performance computing: we already design algorithms with accel- 
erators in mind. 


1.4 An Overview of Quantum Machine Learning 
Algorithms 


Dozens of articles have been published on quantum machine learning, and we observe 
some general characteristics that describe the various approaches. We summarize our 
observations in Table 1.1, and detail the main traits below. 

Many quantum learning algorithms rely on the application of Grover’s search 
or one of its variants (Section 4.5). This includes mostly unsupervised methods: 
K-medians, hierarchical clustering, or quantum manifold embedding (Chapter 10). 
In addition, quantum associative memory and quantum neural networks often rely on 
this search (Chapter 11). An early version of quantum support vector machines also 


Table 1.1 The Characteristics of the Main Approaches to Quantum Machine Learning 


Algorithm Reference Grover Speedup Quantum Generalization Implementation 


Data Performance 


K-medians Aimeur et al. (2013) Yes Quadratic 
Hierarchical clustering Aimeur et al. (2013) Yes Quadratic 
K-means Lloyd et al. (2013a) Optional Exponential 
Principal components Lloyd et al. (2013b) No Exponential 
Associative memory Ventura and Martinez (2000) Yes 
Trugenberger (2001) No 
Neural networks Narayanan and Menneer (2000) Yes Numerical 
Support vector machines Anguita et al. (2003) Yes Quadratic Analytical 
Rebentrost et al. (2013) No Exponential No 
Nearest neighbors Wiebe et al. (2014) Quadratic Numerical 
Regression Bisio et al. (2010) No 
Boosting Neven et al. (2009) Quadratic Analytical 


The column headed “Algorithm” lists the classical learning method. The column headed “Reference” lists the most important articles related to the quantum variant. The column headed 
“Grover” indicates whether the algorithm uses Grover’s search or an extension thereof. The column headed “Speedup” indicates how much faster the quantum variant is compared 
with the best known classical version. “Quantum data” refers to whether the input, output, or both are quantum states, as opposed to states prepared from classical vectors. The column 
headed “Generalization performance” states whether this quality of the learning algorithm was studied in the relevant articles. “Implementation” refers to attempts to develop a physical 
realization. 
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uses Grover’s search (Section 12.2). In total, about half of all the methods proposed 
for learning in a quantum setting use this algorithm. 

Grover’s search has a quadratic speedup over the best possible classical algorithm 
on unordered data sets. This sets the limit to how much faster those learning methods 
that rely on it get. Exponential speedup is possible in scenarios where both the input 
and the output are also quantum: listing class membership or reading the classical data 
once would imply at least linear time complexity, which could only be a polynomial 
speedup. Examples include quantum principal component analysis (Section 10.3), 
quantum K-means (Section 10.5), and a different flavor of quantum support vector 
machines (Section 12.3). Regression based on quantum process tomography requires 
an optimal input state, and, in this regard, it needs a quantum input (Chapter 13). Ata 
high level, it is possible to define an abstract class of problems that can only be learned 
in polynomial time by quantum algorithms using quantum input (Section 2.5). 

A strange phenomenon is that few authors have been interested in the general- 
ization performance of quantum learning algorithms. Analytical investigations are 
especially sparse, with quantum boosting by adiabatic quantum computing being 
a notable exception (Chapter 14), along with a form of quantum support vector 
machines (Section 12.2). Numerical comparisons favor quantum methods in the 
case of quantum neural networks (Chapter 11) and quantum nearest neighbors 
(Section 12.1). 

While we are far from developing scalable universal quantum computers, learning 
methods require far more specialized hardware, which is more attainable with current 
technology. A controversial example is adiabatic quantum optimization in learning 
problems (Section 14.7), whereas more gradual and well founded are small-scale 
implementations of quantum perceptrons and neural networks (Section 11.4). 


1.5 Quantum-Like Learning on Classical Computers 


Machine learning has a lot to adopt from quantum mechanics, and this statement is 
not restricted to actual quantum computing implementations of learning algorithms. 
Applying principles from quantum mechanics to design algorithms for classical 
computers is also a successful field of inquiry. We refer to these methods as quantum- 
like learning. Superposition, sensitivity to contexts, entanglement, and the linearity of 
evolution prove to be useful metaphors in many scenarios. These methods are outside 
our scope, but we highlight some developments in this section. For a more detailed 
overview, we refer the reader to Manju and Nigam (2012). 

Computational intelligence is a field related to machine learning that solves 
optimization problems by nature-inspired computational methods. These include 
swarm intelligence (Kennedy and Eberhart, 1995), force-driven methods (Chatterjee 
et al., 2008), evolutionary computing (Goldberg, 1989), and neural networks 
(Rumelhart et al., 1994). A new research direction which borrows metaphors from 
quantum physics emerged over the past decade. These quantum-like methods 
in machine learning are in a way inspired by nature; hence, they are related to 
computational intelligence. 
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Quantum-like methods have found useful applications in areas where the system 
is displaying contextual behavior. In such cases, a quantum approach naturally 
incorporates this behavior (Khrennikov, 2010; Kitto, 2008). Apart from contextual- 
ity, entanglement is successfully exploited where traditional models of correlation 
fail (Bruza and Cole, 2005), and quantum superposition accounts for unusual results 
of combining attributes of data instances (Aerts and Czachor, 2004). 

Quantum-like learning methods do not represent a coherent whole; the algorithms 
are liberal in borrowing ideas from quantum physics and ignoring others, and hence 
there is seldom a connection between two quantum-like learning algorithms. 

Coming from evolutionary computing, there is a quantum version of particle swarm 
optimization (Sun et al., 2004). The particles in a swarm are agents with simple 
patterns of movements and actions, each one is associated with a potential solution. 
Relying on only local information, the quantum variant is able to find the global 
optimum for the optimization problem in question. 

Dynamic quantum clustering emerged as a direct physical metaphor of evolving 
quantum particles (Weinstein and Horn, 2009). This approach approximates the 
potential energy of the Hamiltonian, and evolves the system iteratively to identify 
the clusters. The great advantage of this method is that the steps can be computed 
with simple linear algebra operations. The resulting evolving cluster structure is 
similar to that obtained with a flocking-based approach, which was inspired by 
biological systems (Cui et al., 2006), and it is similar to that resulting from Newtonian 
clustering with its pairwise forces (Blekas and Lagaris, 2007). Quantum-clustering- 
based support vector regression extends the method further (Yu et al., 2010). 

Quantum neural networks exploit the superposition of quantum states to accommo- 
date gradual membership of data instances (Purushothaman and Karayiannis, 1997). 
Simulated quantum annealing avoids getting trapped in local minima by using the 
metaphor of quantum tunneling (Sato et al., 2009) 

The works cited above highlight how the machine learning community may benefit 
from quantum metaphors, potentially gaining higher accuracy and effectiveness. We 
believe there is much more to gain. An attractive aspect of quantum theory is the 
inherent structure which unites geometry and probability theory in one framework. 
Reasoning and learning in a quantum-like method are described by linear algebra 
operations. This, in turn, translates to computational advantages: software libraries 
of linear algebra routines are always the first to be optimized for emergent hardware. 
Contemporary high-performance computing clusters are often equipped with graphics 
processing units, which are known to accelerate many computations, including linear 
algebra routines, often by several orders of magnitude. As pointed out by Asanovic 
et al. (2006), the overarching goal of the future of high-performance computing 
should be to make it easy to write programs that execute efficiently on highly 
parallel computing systems. The metaphors offered by quantum-like methods bring 
exactly this ease of programming supercomputers to machine learning. Early results 
show that quantum-like methods can, indeed, be accelerated by several orders of 
magnitude (Wittek, 2013). 


Machine Learning 


Machine learning is a field of artificial intelligence that seeks patterns in empirical 
data without forcing models on the data—that is, the approach is data-driven, rather 
than model-driven (Section 2.1). A typical example is clustering: given a distance 
function between data instances, the task is to group similar items together using an 
iterative algorithm. Another example is fitting a multidimensional function on a set of 
data points to estimate the generating distribution. 

Rather than a well-defined field, machine learning refers to a broad range of 
algorithms. A feature space, a mathematical representation of the data instances under 
study, is at the heart of learning algorithms. Learning patterns in the feature space 
may proceed on the basis of statistical models or other methods known as algorithmic 
learning theory (Section 2.2). 

Statistical modeling makes propositions about populations, using data drawn 
from the population of interest, relying on a form of random sampling. Any form 
of statistical modeling requires some assumptions: a statistical model is a set of 
assumptions concerning the generation of the observed data and similar data (Cox, 
2006). 

This contrasts with methods from algorithmic learning theory, which are not 
statistical or probabilistic in nature. The advantage of algorithmic learning theory is 
that it does not make use of statistical assumptions. Hence, we have more freedom 
in analyzing complex real-life data sets, where samples are dependent, where there is 
excess noise, and where the distribution is entirely unknown or skewed. 

Irrespective of the approach taken, machine learning algorithms fall into two major 
categories (Section 2.3): 


1. Supervised learning: the learning algorithm uses samples that are labeled. For example, the 
samples are microarray data from cells, and the labels indicate whether the sample cells are 
cancerous or healthy. The algorithm takes these labeled samples and uses them to induce 
a classifier. This classifier is a function that assigns labels to samples, including those that 
have never previously been seen by the algorithm. 

2. Unsupervised learning: in this scenario, the task is to find structure in the samples. For 
instance, finding clusters of similar instances in a growing collection of text documents 
reveals topical changes across time, highlighting trends of discussions, and indicating 
themes that are dropping out of fashion. 


Learning algorithms, supervised or unsupervised, statistical or not statistical, are 
expected to generalize well. Generalization means that the learned structure will apply 
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beyond the training set: new, unseen instances will get the correct label in supervised 
learning, or they will be matched to their most likely group in unsupervised learning. 
Generalization usually manifests itself in the form of a penalty for complexity, such as 
restrictions for smoothness or bounds on the vector space norm. Less complex models 
are less likely to overfit the data (Sections 2.4 and 2.5). 

There is, however, no free lunch: without a priori knowledge, finding a learning 
model in reasonable computational time that applies to all problems equally well 
is unlikely. For this reason, the combination of several learners is commonplace 
(Section 2.6), and it is worth considering the computational complexity in learning 
theory (Section 2.7). 

While there are countless other important issues in machine learning, we restrict 
our attention to the ones outlined in this chapter, as we deem them to be most relevant 
to quantum learning models. 


2.1 Data-Driven Models 


Machine learning is an interdisciplinary field: it draws on traditional artificial intelli- 
gence and statistics. Yet, it is distinct from both of them. 

Statistics and statistical inference put data at the center of analysis to draw 
conclusions. Parametric models of statistical inference have strong assumptions. For 
instance, the distribution of the process that generates the observed values is assumed 
to be a multivariate normal distribution with only a finite number of unknown 
parameters. Nonparametric models do not have such an assumption. Since incorrect 
assumptions invalidate statistical inference (Kruskal, 1988), nonparametric methods 
are always preferred. This approach is closer to machine learning: fewer assumptions 
make a learning algorithm more general and more applicable to multiple types of data. 

Deduction and reasoning are at the heart of artificial intelligence, especially in 
the case of symbolic approaches. Knowledge representation and logic are key tools. 
Traditional artificial intelligence is thus heavily dependent on the model. Dealing with 
uncertainty calls for statistical methods, but the rigid models stay. Machine learning, 
on the other hand, allows patterns to emerge from the data, whereas models are 
secondary. 


2.2 Feature Space 


We want a learning algorithm to reveal insights into the phenomena being observed. 
A feature is a measurable heuristic property of the phenomena. In the statistical 
literature, features are usually called independent variables, and sometimes they are 
referred to as explanatory variables or predictors. Learning algorithms work with 
features—a careful selection of features will lead to a better model. 

Features are typically numeric. Qualitative features—for instance, string values 
such as small, medium, or large—are mapped to numeric values. Some discrete 
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structures, such as graphs (Kondor and Lafferty, 2002) or strings (Lodhi et al., 2002), 
have nonnumeric features. 

Good features are discriminating: they aid the learner in identifying patterns and 
distinguishing between data instances. Most algorithms also assume independent 
features with no correlation between them. In some cases, dependency between 
features is beneficial, especially if only a few features are nonzero for each data 
instance—that is, the features are sparse (Wittek and Tan, 2011). 

The multidisciplinary nature of machine learning is reflected in how features are 
viewed. We may take a geometric view, treating features as tuples, vectors in a high- 
dimensional space—the feature space. Alternatively, we may view features from a 
probabilistic perspective, treating them as a multivariate random variables. 

In the geometric view, features are grouped into a feature vector. Let d denote the 
number of features. One vector of the canonical basis {e1, e2,..., eg} of R is assigned 
to each feature. Let xj; be the weight of a feature i in data instance j. Thus, a feature 
vector x; for the object j is a linear combination of the canonical basis vectors: 


d 
xj = oe (2.1) 
i=1 
By writing x; as a column vector, we have x} = (X1j,X2;,...,Xaj). For a collection of 


N data instances, the x; weights form a d x N matrix. 

Since the basis vectors of the canonical basis are perpendicular to one another, this 
implies the assumption that the features are mutually independent; this assumption is 
often violated. The assignment of features to vectors is arbitrary: a feature may be 
assigned to any of the vectors of the canonical basis. 

With use of the geometric view, distance functions, norms of vectors, and angles 
help in the design of learning algorithms. For instance, the Euclidean distance is 
commonly used, and it is defined as follows: 


d 
dxi x)= | >> (Kei — Xx)”. 2.2) 
k=1 


If the feature space is binary, we often use the Hamming distance, which measures 
how many 1’s are different in the two vectors: 


d 
d(Ki, xj) = )(xti © xy), (2.3) 
k=1 
where @ is the XOR operator. This distance is useful in efficiently retrieving elements 
from a quantum associative memory (Section 11.1). 
The cosine of the smallest angle between two vectors, also called the cosine 
similarity, is given as 


Ty. 
X; Xj 


cos(Xx;, Xj) = ————_.. 
oo" xlix] 


(2.4) 
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Other distance and similarity functions are of special importance in kernel-based 
learning methods (Chapter 7). 

The probabilistic view introduces a different set of tools to help design algorithms. 
It assumes that each feature is a random variable, defined as a function that assigns 
a real number to every outcome of an experiment (Zaki and Meira, 2013, p. 17). A 
discrete random variable takes any of a specified finite or countable list of values. 
The associated probabilities form a probability mass function. A continuous random 
variable takes any numerical value in an interval or in a collection of intervals. In the 
continuous case, a probability density function describes the distribution. 

Irrespective of the type of random variable, the associated cumulative probabilities 
must add up to 1. In the geometric view, this corresponds to normalization constraints. 

Like features group into a feature vector in the geometric view, the probabilistic 
view has a multivariate random variable for each data instance: (X1, X2,.. Xa). 
A joint probability mass function or density function describes the distribution. The 
random variables are independent if and only if the joint probability decomposes to 
the product of the constituent distributions for every value of the range of the random 
variables: 


P(X1, X2, ..., Xa) = P(X1)P(X2) --- P(X). (2.5) 


This independence translates to the orthogonality of the basis vectors in the geometric 
view. 

Not all features are equally important in the feature space. Some might mirror 
the distribution of another one—strong correlations may exist among features, 
violating independence assumptions. Others may get consistently low weights or low 
probabilities to the extent that their presence is negligible. Having more features 
should result in more discriminating power and thus higher effectiveness in machine 
learning. However, practical experience with machine learning algorithms has shown 
that this is not always the case. 

Irrelevant or redundant training information adversely affects many common 
machine learning algorithms. For instance, the nearest neighbor algorithm is sensitive 
to irrelevant features. Its sample complexity—number of training examples needed 
to reach a given accuracy level—grows exponentially with the number of irrelevant 
features (Langley and Sage, 1994b). Sample complexity for decision tree algorithms 
grows exponentially for some concepts as well. Removing irrelevant and redundant 
information produces smaller decision trees (Kohavi and John, 1997). The naïve 
Bayes classifier is also affected by redundant features owing to its assumption that 
features are independent given the class label (Langley and Sage, 1994a). However, 
in the case of support vector machines, feature selection has a smaller impact on the 
efficiency (Weston et al., 2000). 

The removal of redundant features reduces the number of dimensions in the space, 
and may improve generalization performance (Section 2.4). The potential benefits 
of feature selection and feature extraction include facilitating data visualization and 
data understanding, reducing the measurement and storage requirements, reducing 
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training and utilization times, and defying the curse of dimensionality to improve 
prediction performance (Guyon et al., 2003). Methods differ in which aspect they put 
more emphasis on. Getting the right number of features is a hard task. 

Feature selection and feature extraction are the two fundamental approaches in 
reducing the number of dimensions. Feature selection is the process of identifying 
and removing as much irrelevant and redundant information as possible. Feature 
extraction, on the other hand, creates a new, reduced set of features which combines 
elements of the original feature set. 

A feature selection algorithm employs an evaluation measure to score different 
subsets of the features. For instance, feature wrappers take a learning algorithm, and 
train it on the data using subsets of the feature space. The error rate will serve as 
an evaluation measure. Since feature wrappers train a model in every step, they are 
expensive to evaluate. Feature filters use more direct evaluation measures such as 
correlation or mutual information. Feature weighting is a subclass of feature filters. 
It does not reduce the actual dimension, but weights and ranks features according to 
their importance. 

Feature extraction applies a transformation on the feature vector to perform 
dimensionality reduction. It often takes the form of a projection: principal component 
analysis and lower-rank approximation with singular value decomposition belong 
to this category. Nonlinear embeddings are also popular. The original feature set 
will not be present, and only derived features that are optimal according to some 
measure will be present—this task may be treated as an unsupervised learning scenario 
(Section 2.3). 


2.3 Supervised and Unsupervised Learning 


We often have a well-defined goal for learning. For instance, taking a time series, we 
want a learning algorithm to fit a nonlinear function to approximate the generating 
process. In other cases, the objective of learning is less obvious: there is a pattern 
we are seeking, but we are uncertain what it might be. Given a set of high- 
dimensional points, we may ask which points form nonoverlapping groups—clusters. 
The clusters and their labels are unknown before we begin. According to whether the 
goal is explicit, machine learning splits into two major paradigms: supervised and 
unsupervised learning. 

In supervised learning, each data point in a feature space comes with a label 
(Figure 2.1). The label is also called an output or a response, or, in classical statistical 
literature, a dependent variable. Labels may have a continuous numerical range, 
leading to a regression problem. In classification, the labels are the elements of a 
fixed, finite set of numerical values or qualitative descriptors. If the set has two 
values—for instance, yes or no, 0 or 1, +1 or —1—we call the problem binary 
classification. Multiclass problems have more than two labels. Qualitative labels are 
typically encoded as integers. 

A supervised learner predicts the label of instances after training on a sample of 
labeled examples, the training set. At a high level, supervised learning is about fitting a 
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Figure 2.1 Supervised learning. Given labeled training instances, the goal is to identify a 
decision surface that separates the classes. 


predefined multivariate function to a set of points. In other words, supervised learning 
is function approximation. 

We denote a label by y. The training set is thus a collection of pairs of data points 
and corresponding labels: {(x1, y1), (K2, y2), . - . , (XN, yy) }, where N is the number of 
training instances. 

In an unsupervised scenario, the labels are missing. A learning algorithm must 
extract structure in the data on its own (Figure 2.2). Clustering and low-dimensional 
embedding belong to this category. Clustering finds groups of data instances such 
that instances in the same group are more similar to each other than to those in other 
groups. The groups—or clusters—may be embedded in one another, and the density of 
data instances often varies across the feature space; thus, clustering is a hard problem 
to solve in general. 

Low-dimensional embedding involves projecting data instances from the high- 
dimensional feature space to a more manageable number of dimensions. The target 
number of dimensions depends on the task. It can be as high as 200 or 300. For 
example, if the feature space is sparse, but it has several million dimensions, it 
is advantageous to embed the points in 200 dimensions (Deerwester et al., 1990). 
If we project to just two or three dimensions, we can plot the data instances in 
the embedding space to reveal their topology. For this reason, a good embedding 
algorithm will preserve either the local topology or the global topology of the points 
in the original high-dimensional space. 

Semisupervised learning makes use of both labeled and unlabeled examples to 
build a model. Labels are often expensive to obtain, whereas data instances are 
available in abundance. The semisupervised approach learns the pattern using the 
labeled examples, then refines the decision boundary between the classes with the 
unlabeled examples. 
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Figure 2.2 Unsupervised learning. The training instances do not have a label. The learning 
process identifies the classes automatically, often creating a decision boundary. 


Active learning is a variant of semisupervised learning in which the learning 
algorithm is able to solicit labels for problematic unlabeled instances from an 
appropriate information source—for instance, from a human annotator (Settles, 2009). 
Similarly to the semisupervised setting, there are some labels available, but most of 
the examples are unlabeled. The task in a learning iteration is to choose the optimal 
set of unlabeled examples for which the algorithm solicits labels. Following Settles 
(2009), these are some typical strategies to identify the set for labeling: 


+ Uncertainty sampling: the selected set corresponds to those data instances where the confi- 
dence is low. 

* Query by committee: train a simple ensemble (Section 2.6) that casts votes on data instances, 
and select those which are most ambiguous. 

+ Expected model change: select those data instances that would change the current model the 
most if the learner knew its label. This approach is particularly fruitful in gradient-descent- 
based models, where the expected change is easy to quantify by the length of the gradient. 

+ Expected error reduction: select those data instances where the model performs poorly—that 
is, where the generalization error (Section 2.4) is most likely to be reduced. 

+ Variance reduction: generalization performance is hard to measure, whereas minimizing out- 
put variance is far more feasible; select those data instances which minimize output variance. 

+ Density-weighted methods: the selected instances should be not only uncertain, but also 
representative of the underlying distribution. 


It is interesting to contrast these active learning strategies with the selection of optimal 
state in quantum process tomography (Section 13.6). 

One particular form of learning, transductive learning, will be relevant in 
later chapters, most notably in Chapter 13. The models mentioned so far are 
inductive: on the basis of data points—labeled or unlabeled—we infer a function 
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Figure 2.3 Transductive learning. A model is not inferred, there are no decision surfaces. The 
label of training instances is propagated to the unlabeled instances, which are provided at the 
same time as the training instances. 


that will be applied to unseen data points. Transduction avoids this inference 
to the more general case, and it infers from particular instances to particular 
instances (Figure 2.3) (Gammerman et al., 1998). This way, transduction asks 
for less: an inductive function implies a transductive one. Transduction is 
similar to instance-based learning, a family of algorithms that compares new 
problem instances with training instances—K-means clustering is an example 
(Section 5.3). If some labels are available, transductive learning is similar to semisu- 
pervised learning. Yet, transduction is different from all the learning approaches men- 
tioned thus far. Instance-based learning can be inductive, and semisupervised learning 
is inductive, whereas transductive learning avoids inductive reasoning by definition. 


2.4 Generalization Performance 


If a learning algorithm learns to reproduce the labels of the training data with 
100% accuracy, it still does not follow that the learned model will be useful. What 
makes a good learner? A good algorithm will generalize well to previously unseen 
instances. This is why we start training an algorithm: it is hardly interesting to 
see labeled examples classified again. Generalization performance characterizes a 
learner’s prediction capability on independent test data. 

Consider a family of functions f that approximate a function that generates the data 
g(x) = y based on a sample {(x1, y1), (K2, y2),..., XN, yw) }. The sample itself suffers 
from random noise with a zero mean and variance o? . 

We define a loss function L depending on the values y takes. If y is a continuous 
real number—that is, we have a regression problem—typical choices are the squared 
error 


Lyi. f(x) = (i — fOD, (2.6) 


and the absolute error 


Lyi, f (Xi) = Wi =f). (2.7) 
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In the case of binary classes, the 0-1 loss function is defined as 


Loi. f &i)) = lye sy,)- (2.8) 


where 1 is the indicator function. Optimizing for a classification problem with a 
0-1 loss function is an NP-hard problem even for such a relatively simple class of 
functions as linear classifiers (Feldman et al., 2012). It is often approximated by a 
convex function that makes optimization easier. The hinge loss—notable for its use 
by support vector machines—is one such approximation: 


LOi f (xi)) = max(0, 1 — f(xi)y). (2.9) 


Here f : R? + R—that is, the range of the function is not just {0, 1}. 
Given a loss function, the training error (or empirical risk) is defined as 


1 n 
E= > 9 Lif si). (2.10) 
i=1 
Finding a model in a class of functions that minimizes this error is called empirical 
risk minimization. A model with zero training error, however, overfits the training data 
and will generalize poorly. Consider, for instance, the following function: 


yi ifx = Xi, 


fa) = (2.11) 


0 otherwise. 


This function is empirically optimal—the training error is zero. Yet, it is easy to see 
that this function is not what we are looking for. 

Take a test sample x from the underlying distribution. Given the training set, the 
test error or generalization error is 


Ex(f) = Lx, f(x)). (2.12) 


The expectation value of the generalization error is the true error we are interested 
in: 


En (f) = ELE, fOD, y1), 2,92). +++, ON, yN)}). (2.13) 


We estimate the true error over test samples from the underlying distribution. 

Let us analyze the structure of the error further. The error over the distribution will 
be E* = E[L(x, f(x))] = o”; this error is also called Bayes error. The best possible 
model of the family of functions f will have an error that no longer depends on the 
training set: Epes f) = inf{E[L(x, f(x))]}. 

The ultimate question is how close we can get with the family of functions to the 
Bayes error using the sample: 


En(f) — E* = (En (f) — Evest(f)) + (Ebest (f) — E*). (2.14) 


The first part of the sum is the estimation error: Ey (f) — Ebest(f). This is controlled 
and usually small. 
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The second part is the approximation error or model bias: Epest(f) — E*. This is 
characteristic for the family of approximating functions chosen, and it is harder to 
control, and typically larger than the estimation error. 

The estimation error and model bias are intrinsically linked. The more complex we 
make the model f, the lower the bias is, but in exchange, the estimation error increases. 
This tradeoff is analyzed in Section 2.5. 


2.5 Model Complexity 


The complexity of the class of functions performing classification or regression and 
the algorithm’s generalizability are related. The Vapnik-Chervonenkis (VC) theory 
provides a general measure of complexity and proves bounds on errors as a function 
of complexity. Structural risk minimization is the minimization of these bounds, which 
depend on the empirical risk and the capacity of the function class (Vapnik, 1995). 

Consider a function f with a parameter vector @: it shatters a set of data points 
{X1, X2, . . ., Xy} if, for all assignments of labels to those points, there exists a 0 such 
that the function f makes no errors when evaluating that set of data points. A set of 
N points can be labeled in 2% ways. A rich function class is able to realize all 2% 
separations—that is, it shatters the N points. 

The idea of VC dimensions lies at the core of the structural risk minimization 
theory: it measures the complexity of a class of functions. This is in stark contrast 
to the measures of generalization performance in Section 2.4, which derive them from 
the sample and the distribution. 

The VC dimension of a function f is the maximum number of points that are 
shattered by f. In other words, the VC dimension of the function f is h’, where h’ 
is the maximum A such that some data point set of cardinality h can be shattered by f. 
The VC dimension can be infinity (Figure 2.4). 


(a) (b) 


Figure 2.4 Examples of shattering sets of points. (a) A line on a plane can shatter a set of 
three points with arbitrary labels, but it cannot shatter certain sets of four points; hence, a line 
has a VC dimension of four. (b) A sine function can shatter any number of points with any 
assignment of labels; hence, its VC dimension is infinite. 
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Vapnik’s theorem proves a connection between the VC dimension, empirical risk, 
and the generalization performance (Vapnik and Chervonenkis, 1971). The probability 
of the test error distancing from an upper bound on data that are drawn independent 
and identically distributed from the same distribution as the training set is given by 


z (ao ER [mem +1- ea) Shy TR 


n 


if h <n, where h is the VC dimension of the function. When h «& n, the function 
class should be large enough to provide functions that are able to model the hidden 
dependencies in the joint distribution P(x, y). 

This theorem formally binds model complexity and generalization performance. 
Empirical risk minimization—introduced in Section 2.4—allows us to pick an optimal 
model given a fixed VC dimension h for the function class. The principle that derives 
from Vapnik’s theorem—structural risk minimization—goes further. We optimize 
empirical risk for a nested sequence of increasingly complex models with VC 
dimensions hı < h2 < ---, and select the model with the smallest value of the upper 
bound in Equation 2.15. 

The VC dimension is a one-number summary of the learning capacity of a class of 
functions, which may prove crude for certain classes (Schélkopf and Smola, 2001, 
p. 9). Moreover, the VC dimension is often difficult to calculate. Structural risk 
minimization successfully applies in some cases, such as in support vector machines 
(Chapter 7). 

A concept related to VC dimension is probably approximately correct (PAC) learn- 
ing (Valiant, 1984). PAC learning stems from a different background: it introduces 
computational complexity to learning theory. Yet, the core principle is common. Given 
a finite sample, a learner has to choose a function from a given class such that, with 
high probability, the selected function will have low generalization error. A set of 
labels y; are PAC-learnable if there is an algorithm that can approximate the labels with 
a predefined error 0 < € < 1/2 with a probability at least 1 — 6, where 0 < 6 < 1/2 
is also predefined. A problem is efficiently PAC-learnable if it is PAC-learnable by 
an algorithm that runs in time polynomial in 1/e, 1/5, and the dimension d of the 
instances. Under some regularity conditions, a problem is PAC-learnable if and only 
if its VC dimension is finite (Blumer et al., 1989). 

An early result in quantum learning theory proved that all PAC-learnable function 
classes are learnable by a quantum model (Servedio and Gortler, 2001); in this 
sense, quantum and classical PAC learning are equivalent. The lower bound on the 
number of examples required for quantum PAC learning is close to the classical 
bound (Atici and Servedio, 2005). Certain classes of functions with noisy labels that 
are classically not PAC-learnable can be learned by a quantum model (Bshouty and 
Jackson, 1995). If we restrict our attention to transductive learning problems, and 
we do not want to generalize to a function that would apply to an arbitrary number 
of new instances, we can explicitly define a class of problems that would take an 
exponential amount of time to solve classically, but a quantum algorithm could learn it 
in polynomial time (Gavinsky, 2012). This approach does not fall in the bounded error 
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quantum polynomial time class of decision problems, to which most known quantum 
algorithms belong (see Section 4.6). 

The connection between PAC-learning theory and machine learning is indirect, 
but explicit connection has been made to some learning algorithms, including neural 
networks (Haussler, 1992). This already suggests that quantum machine learning 
algorithms learn with a higher precision, even in the presence of noise. We give more 
specific details in Chapters 11 and 14. Here we point out that we do not deal with the 
exact identification of a function (Angluin, 1988), which also has various quantum 
formulations and accompanying literature. 

Irrespective of how we optimize the learning function, there is no free lunch: there 
cannot be a class of functions that is optimal for all learning problems (Wolpert and 
Macready, 1997). For any optimization or search algorithm, better performance in one 
class of problems is balanced by poorer performance in another class. For this reason 
alone, it is worth looking into combining different learning models. 


2.6 Ensembles 


A learning algorithm will always have strengths and weaknesses: a single model is 
unlikely to fit every possible scenario. Ensembles combine multiple models to achieve 
higher generalization performance than any of the constituent models is capable of. A 
constituent model is also called a base classifier or weak learner, and the composite 
model is called a strong learner. 

Apart from generalization performance, there are further reasons for using 
ensemble-based systems (Polikar, 2006): 


+ Large volumes of data: the computational complexity of many learning algorithms is much 
higher than linear time. Large data sets are often not feasible for training an algorithm. 
Splitting the data, training separate classifiers, and using an ensemble of them is often more 
efficient. 

+ Small volumes of data: ensembles help with the other extreme as well. By resampling with 
replacement, numerous classifiers learn on samples of the same data, yielding a higher per- 
formance. 

+ Divide and conquer: the decision boundary of problems is often a complex nonlinear surface. 
Instead of using an intricate algorithm to approximate the boundary, several simple learners 
might work just as efficiently. 

+ Data fusion: data often originate from a range of sources, leading to vastly different feature 
sets. Some learning algorithms work better with one type of feature set. Training separate 
algorithms on divisions of feature sets leads to data fusion, and efficient composite learners. 


Ensembles yield better results when there is considerable diversity among the base 
classifiers—irrespective of the measure of diversity (Kuncheva and Whitaker, 2003). 
If diversity is sufficient, base classifiers make different errors, and a strategic combi- 
nation may reduce the total error—ideally improving generalization performance. 

The generic procedure of ensemble methods has two steps: first, develop a set of 
base classifiers from the training data; second, combine them to form a composite 
predictor. In a simple combination, the base learners vote, and the label prediction is 
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based on the collection of votes. More involved methods weigh the votes of the base 
learners. 

More formally, we train K base classifiers, M1, Mo,..., Mx. Each model is trained 
on a subset of {(X1, y1), (X2, y2),..-, (XN, yn) }; the subsets may overlap in consecutive 
training runs. A base classifier should have higher accuracy than random guessing. 
The training of an M; classifier is independent from training of the other classifiers; 
hence, parallelization is easy and efficient (Han et al., 2012, p. 378). 

Popular ensemble methods include bagging, random forests, stacking, and boost- 
ing. In bagging—short for “bootstrap aggregating’ —the base learners vote with equal 
weight (Breiman, 1996; Efron, 1979). To improve diversity among the learned models, 
bagging generates a random training subset from the data for each base classifier Mj. 

Random forests are an application of bagging to decision trees (Breiman, 2001). 
Decision trees are simple base classifiers that are fast to train. Random forests train 
many decision trees on random samples of the data, keeping the complexity of each 
tree low. Bagging decides the eventual label on a data instance. Random forests are 
known to be robust to noise. 

Stacking is an improvement over bagging. Instead of counting votes, stacking trains 
a learner on the basis of the output of the base classifiers (Wolpert, 1992). For instance, 
suppose that the decision surface of a particular base classifier cannot fit a part of the 
data and it incorrectly learns a certain region of the feature space. Instances coming 
from that region will be consistently misclassified: the stacked learner may be able to 
learn this pattern, and correct the result. 

Unlike the previous methods, boosting does not train models in parallel: the base 
classifiers are trained in a sequence (Freund and Schapire, 1997; Schapire, 1990). Each 
subsequent base classifier is built to emphasize the training instances that previous 
learners misclassified. Boosting is a supervised search in the space of weak learners 
which may be regularized (see Chapters 9 and 14). 


2.7 Data Dependencies and Computational Complexity 


We are looking for patterns in the data: to extract the patterns, we analyze relationships 
between instances. We are interested in how one instance relates to other instances. 
Yet, not every pair of instances is of importance. Which data dependencies should 
we look at? How do dependencies influence computational time? These questions are 
crucial to understand why certain algorithms are favored on contemporary hardware, 
and they are equally important to see how quantum computers reduce computational 
complexity. 

As a Starting point, consider the trivial case: we compare every data instance with 
every other one. If the data instances are nodes in a graph, the dependencies form 
a complete graph Ky—this is an N : N dependency. This situation frequently occurs 
in learning algorithms. For instance, if we calculate a distance matrix, we will have 
this type of dependency. The kernel matrix of a support vector machine (Chapter 7) 
also exhibits N : N data dependency. In a distributed computing environment, N : N 
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dependencies will lead to excess communication between the nodes, as data instances 
will be located in remote nodes, and their feature vectors or other description must be 
exchanged to establish the distance. 

Points that lie the furthest apart are not especially interesting to compare, but it is 
not immediately obvious which points lie close to one another in a high-dimensional 
space. Spatial data structures help in reducing the size of sets of data instances that 
are worth comparing. Building a tree-based spatial index often pays off. Examples 
include the R*-tree (Beckmann et al., 1990) or the X-tree (Berchtold et al., 1996) for 
data from a vector space, or the M-tree (Ciaccia et al., 1997) for data from a metric 
space. The height of such a tree-based index is O(log N) for a database of N objects 
in the worst case. Such structures not only reduce the necessary comparisons, but may 
also improve the performance of the learner, as in the case of clustering-based support 
vector machines (Section 7.9). 

In many learning algorithms, data instances are never compared directly. Neural 
networks, for example, adjust their weights as data instances arrive at the input 
nodes (Chapter 6). The weights act as proxies; they capture relations between 
instances without directly comparing them. If there are K weights in total in a given 
topology of the network, the dependency pattern will be N : K. If N > K, it becomes 
clear why there are theoretical computational advantages to such a scheme. Under 
the same assumption, parallel architectures easily accelerate actual computations 
(Section 10.2). 

Data dependencies constitute a large part of the computational complexity. If the 
data instances are regular dense vectors of d dimensions, calculating a distance matrix 
with N : N dependencies will require O(N?d) time complexity. If we use a tree-based 
spatial index, the run time is reduced to O(dN log N). With access to quantum memory, 
this complexity reduces to O(log poly(V))—an exponential speedup over the classical 
case (Section 10.2). 

If proxies are present to replace direct data dependencies, the time complexity will 
be in the range of O(VK). The overhead of updating weights can outweigh the benefit 
of lower theoretical complexity. 

Learning is an iterative process; hence, eventual computational complexity will 
depend on the form of optimization performed and on the speed of convergence. A 
vast body of work is devoted to reformulating the form of optimization in learning 
algorithms—some are more efficient than others. Restricting the algorithm often 
yields reduced complexity. For instance, support vector machines with linear kernels 
can be trained in linear time (Joachims, 2006). 

Convergence is not always fast, and some algorithms never converge—in these 
cases, training stops after reaching appropriate conditions. The number of iterations is 
sometimes hard to predict. 

In the broader picture, learning a classifier with a nonconvex loss function is an NP- 
hard problem even for simple classes of functions (Feldman et al., 2012)—this is the 
key reasoning behind using convex formulation for the optimization (Section 2.4). In 
some special cases, such as support vector machines, it pays off: direct optimization of 
a nonconvex objective function leads to higher accuracy and faster training (Collobert 
et al., 2006). 


Quantum Mechanics 


Quantum mechanics is a rich collection of theories that provide the most complete 
description of nature to date. Some aspects of it are notoriously hard to grasp, yet a tiny 
subset of concepts will be sufficient to understand the relationship between machine 
learning and quantum computing. This chapter collects these relevant concepts, and 
provides a brief introduction, but it deliberately omits important topics that are not 
crucial to understanding the rest of the book; for instance, we do not re-enumerate the 
postulates of quantum mechanics. 

The mathematical toolkit resembles that of machine learning, albeit the context is 
different. We will rely on linear algebra, and, to a much lesser extent, on multivariate 
calculus. Unfortunately, the notation used by physicists differs from that in other 
applications of linear algebra. We use the standard quantum mechanical conventions 
for the notation, while attempting to keeping it in line with that used in the rest of 
the book. 

We start this chapter by introducing the fundamental concept of the superposition of 
state, which will be crucial for all algorithms discussed later (Section 3.1). We follow 
this with an alternative formulation for states by density matrices, which is often more 
convenient to use (Section 3.2). Another phenomenon, entanglement, show stronger 
correlations than what classical systems can realize, and this is increasingly exploited 
in quantum computations (Section 3.3). 

The evolution of closed quantum systems is linear and reversible, which has 
repercussions for learning algorithms (Section 3.4). Measurement on a quantum 
system, on the other hand, is strictly nonreversible, which makes it possible to 
introduce nonlinearity in certain algorithms (Section 3.5). 

The uncertainty principle (Section 3.6) provides an explanation for quantum 
tunneling (Section 3.7), which in turn is useful in certain optimizations, particularly 
in ones that rely on the adiabatic theorem (Section 3.8). 

The last section in this chapter gives a simple explanation of why arbitrary 
quantum states cannot be cloned, which makes copying of quantum data impossible 
(Section 3.9). 

This chapter focuses on concepts that are common to quantum computing and 
derived learning algorithms. Additional concepts—such as representation theory— 
will be introduced in chapters where they are relevant. 
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3.1 States and Superposition 


The state in quantum physics contains statistical information about a quantum system. 
Mathematically, it is represented by a vector—the state vector. A state is essentially a 
probability density; thus, it does not directly describe physical quantities such as mass 
or charge density. 

The state vector is an element of a Hilbert space. The choice of Hilbert space 
depends on the purpose, but in quantum information theory, it is most often C”. A 
vector has a special notation in quantum mechanics, the Dirac notation. A vector— 
also called a ket—is denoted by 


Iv), (3.1) 


where w is just a label. This label is as arbitrary as the name of a vector variable in 
other applications of linear algebra; for instance, the x; data instances in Chapter 2 
could be denoted by any other character. 

The ket notation abstracts the vector space: it no longer matters whether it is 
a finite-dimensional complex space or the infinite-dimensional space of Lebesgue 
square-integrable functions. When the ket is in finite dimensions, it is a column vector. 

Since the state vectors are related to probabilities, some form of normalization must 
be imposed on the vectors. In a general Hilbert space setting, we require the norm of 
the state vectors to equal 1: 


lv) ll = 1. (3.2) 


While it appears logical to denote the zero vector by |0), this notation is reserved 
for a vector in the computational basis (Section 4.1). The null vector will be denoted 
by 0. 

The dual of a ket is a bra. Mathematically, a bra is the conjugate transpose of a ket: 


(wl = lyy. (3.3) 


If the Hilbert space is a finite-dimensional real or complex space, a bra corresponds 
to a row vector. With this notation, an inner product between two states |) and |y) 
becomes 


(oly). (3.4) 


If we choose a basis {|k;)} in the Hilbert space of the quantum system, then a state 
vector |) expands as the linear combination of the basis vectors: 


Iv) = Jalki), (3.5) 


where the a; coefficients are complex numbers, and the sum may be infinite, 
depending on the dimensions of the Hilbert space. The a; coefficients are called 
probability amplitudes, and the normalization constraint on the state vector implies 


that 
5 lai? = 1. (3.6) 
i 


Quantum Mechanics 27 


The sum in Equation 3.5 is called a quantum superposition of the states |k;). Any 
sum of state vectors is a superposition, subject to renormalization. 

The superposition of a quantum system expresses that the system exists in all of 
its theoretically possible states simultaneously. When a measurement is performed, 
however, only one result is obtained, with a probability proportional to the weight of 
the corresponding vector in the linear combination (Section 3.5). 


3.2 Density Matrix Representation and Mixed States 


An alternative representation of states is by density matrices. They are also called 
density operators; we use the two terms interchangeably. The density matrix is an 
operator formed by the outer product of a state vector: 


p= |Y). (3.7) 


A state p that can be written in this form is called a pure state. The state vector might 
be in a superposition, but the corresponding density matrix will still describe a pure 
state. 

Since quantum physics is quintessentially probabilistic, it is advantageous to think 
of a pure state as a pure ensemble, a collection of identical particles with the same 
physical configuration. A pure ensemble is described by one state function y for all 
its particles. The following properties hold for pure states: 


+ A density matrix is idempotent: p? = |W) (pyly yl = IW) (w| = p. 
e Given any orthonormal basis {|7)}, the trace of a density matrix is 1: tr(o) = °,,(n|p|n) = 


Vnlalb) (wla) = Vwi) (aly) = 1. 
e Similarly, to?) =1. l 
* Hermiticity: p" = (lv)(wl)" = |v) yl = p. 
+ Positive semidefinite: ($lel) = ($IY) (Yle) = (ply)? = 0. 


Density matrices allow for states of another type, mixed states. A mixed state is a 
mixture of projectors onto pure states: 


Pmixed = X pil Wi) (Wil- (3.8) 


Taking a Statistical interpretation again, a mixed state consists of identical particles, 
but portions are in different physical configurations. A mixture is described by a set 
of states y; with corresponding probabilities. This justifies the name density matrix: 
a mixed state is a distribution over pure states. The properties of a mixed state are as 
follows: 

2 


+ Idempotency is violated: pfe = (Yo; pili (Wil) (Zvi) 
= J pipihi (Wil) (Vil A Pmixed- 


*  tr(Pmixed) = 1. 
* However, tr(p?...4) < 1. 
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+ Hermicity. 
e Positive semidefinite. 


We do not normally denote mixed states with a lower index as above; instead, we write 
p for both mixed and pure states. 

To highlight the distinction between superposition and mixed states, fix a basis 
{|0), |1)}. A superposition in this two-dimensional space is a sum of two vectors: 


IY) = a0) + £11), (3.9) 
where |a|* + |8|? = 1. The corresponding density matrix is 
la? i) 
= , 3.10 


where * stands for complex conjugation. 
A mixed state is, on the other hand, a sum of projectors: 


ag 2 _ (lal? 0 
Pmixed = |a|~|0)(O| + |B] nyar= ( 0 rar (3.11) 


Interference terms—the off-diagonal elements—are present in the density matrix 
of a pure state (Equation 3.10), but they are absent in a mixed state (Equation 3.11). 

A density matrix is basis-dependent, but the trace of it is invariant with respect to a 
transformation of basis. 

The density matrix of a state is not unique. Different superpositions may have the 
same density matrix: 


1 1 1 
= — |1 .12 
Ii) at = (0 | »). (3.12) 
ly2) = V3 iq) 4 hy Gi : jos i= z )11) (3.13) 
OEA DP ae BT T ' 
3 _l 
p =|) al = (4 ) = |W2) (val = p2. (3.14) 
4 4 


An infinite number of ensembles can generate the same density matrix. Absorb the 
probability of a state vector in the vector itself: |Y) = a;|y). Then the following is 


true: 

Theorem 3.1. The sets {Iwi} and {\dj)} will have the same density matrix if and 

only if Wi) = > uijl®;), where the uj elements form a unitary transformation. 
While there is a clear loss of information by not having a one-to-one corre- 

spondence with state vectors, density matrices provide an elegant description of 

probabilities, and they are often preferred over the state vector formalism. 
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3.3 Composite Systems and Entanglement 


Not every collection of particles is a pure state or a mixed state. Composite quantum 
systems are made up of two or more distinct physical systems. Unlike in classical 
physics, particles can become coupled or entangled, making the composite system 
more than the sum of the components. 

The state space of a composite system is the tensor product of the states of the 
component physical systems. For instance, for two components A and B, the total 
Hilbert space of the composite system becomes Hag = Ha ® Hp. A state vector 
on the composite space is written as |v)4p = |W)4 ® |W)g. The tensor product is 
often abbreviated as |W) 4|w)g, or, equivalently, the labels are written in the same ket 
lWae). 

As an example, assume that the component spaces are two-dimensional, and choose 
a basis in each. Then, a tensor product of two states yields the following composite 
state: 


a 


a y\ _ | a6 
KOHE ss 
Bs 


AB 


The Schmidt decomposition allows a way of expanding a vector on a product space. 
The following is true: 
Theorem 3.2. Let Ha and Hg have orthonormal bases {e,€2,...,@n} and 
{f1,f2,---,fm}, respectively. Then, any bipartite state |v) on Ha ® Hpg can be written 
as 
7 
Iw) = $ arler) Q lfa) (3.16) 


k=1 


where r is the Schmidt rank. 

This decomposition resembles singular value decomposition. 

The density matrix representation is useful for the description of individual 
subsystems of a composite quantum system. The density matrix of the composite 
system is provided by a tensor product: 


PAB = PA ® PB. (3.17) 


We can recover the components p4 and pg from the composite vector pag. A 
subsystem of a composite quantum system is described by a reduced density matrix. 
The reduced density matrix for a subsystem A is defined by 


PA = trB(PAB), (3.18) 


where trg is a partial trace operator over system B. In taking the partial trace, the 
probability amplitudes belonging to system B vanish—this is due to tr(o) = 1 for 
any density matrix. This procedure is also called “tracing out.” Only the amplitudes 
belonging to system A remain. 
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Density matrices and the partial trace operator allow us to find the rank of a Schmidt 
decomposition. Take an orthonormal basis {|f;)} in system B. Then, the reduced 
density matrix p, is 


pa = ttp(pan) = > (fel paslfid: (3.19) 
k 


On the other hand, we can write 


pa = Di (Eae lf) (Ea ex (fi li) = Dia lex)(ex|. (8.20) 


k=1 
Hence, we get rank(o4) =Schmidt rank of pAg. 
Let us study state vectors on the Hilbert space Hag. For example, given a basis 
{10), | 1)}, the most general pure state is given as 
w) =a 00) + 8101) + y|10) + ô|11). (3.21) 


Take as an example a Bell state, defined as |@+) = os (Section 4.1). This 


state cannot be written as a product of two states. 


Suppose there are states w|0) + |1) and y|0) + d|1): 
|O0) + |11) 
———. = (a0 1 0) + 6|1)). 3:22 

A (a|0) + £11) ® (x10) + 4]1)) (3.22) 


Then, 
(œæ|0) + £11)) 8 (y 10) + ô|1)) = æy |00) + wd|01) + 6y |10) + £811). (3.23) 


That is, we need to find a solution to the equations 


ad=0, py=0, pdS=—. (3.24) 


This system of equations does not have a solution; therefore, cannot be a 


|00) +] 11) 

V2 
product state. 

Composite states that can be written as a product state are called separable, whereas 
other composite states are entangled. 

Density matrices reveal information about entangled states. This Bell state has the 


density operator 
100) + [11> (00| + (11| 


3.25 

a = 

__ 100)(00| + “eee EE (3.26) 
Tracing it out in system B, we get 

PA = trp(paB) (3.27) 

__ tg (100) (00]) + tra (|11) (00) + trg (100) (11) + tra (111) (11) (3.28) 


2 


Quantum Mechanics 31 


— 10){01 (010) + HADT IDA) (3.29) 


T00 +IDAa Z 

= 3 = 
The trace of this is less than 1; therefore, it is a mixed state. The entangled state is 
pure in the composite space, but surprisingly, its partial traces are mixed states. From 
the Schmidt decomposition, we see that a state is entangled if and only if its Schmidt 
rank is strictly greater than 1. Therefore, a bipartite pure state is entangled if and only 
if its reduced states are mixed states. 

The reverse process is called purification: given a mixed state, we are interested 
in finding a pure state that gives the mixed state as its reduced density matrix. The 
following theorem holds: 

Theorem 3.3. Let pa be a density matrix acting on a Hilbert space Ha of finite 
dimension n. Then, there exists a Hilbert space Hpg and a pure state |Y) € Ha ® Hpg 
such that the partial trace of \W)(w\| with respect to Hg equals pa: 


tre IW) (Wl) = pa. (3.31) 


The pure state |v) is the purification of pa. 

The purification is not unique; there are many pure states that reduce to the same 
density matrix. We call two states maximally entangled if the reduced density matrix 
is diagonal with equal probabilities as entries. 

Density matrices are able to prove the presence of entanglement in other forms. 
The Peres-Horodecki criterion is a necessary condition for the density matrix of a 
composite system to be separable. For two- or three-dimensional cases, it is a sufficient 
condition (Horodecki et al., 1996). It is useful for mixed states, where the Schmidt 
decomposition does not apply. 

Assume a general state pag acts on a composite Hilbert space 


pas = X pili) GIS Ik) (Ul. (3.32) 
ijkl 


(3.30) 


Its partial transpose with respect to system B is defined as 


Pag =D PHU ® AMDT = Do pË GS IDA. (3.33) 


ijkl ijkl 


If pag is separable, oL ag has nonnegative eigenvalues. 

Quantum entanglement has been experimentally verified (Aspect et al., 1982); it is 
not just an abstract mathematical concept, it is an aspect of reality. Entanglement is a 
correlation between two systems that is stronger than what classical systems are able 
to produce. A local hidden variable theory is one in which distant events do not have 
an instantaneous effect on local ones—seemingly instantaneous events can always be 
explained by hidden variables in the system. Entanglement may produce instantaneous 
correlations between remote systems which cannot be explained by local hidden 
variable theories; this phenomenon is called nonlocality. Classical systems cannot 
produce nonlocal phenomena. 
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Bell’s theorem draws an important line between quantum and classical correlations 
of composite systems (Bell, 1964). The limit is easy to test when given in the 
following inequality (the Clauser-Horne-Shimony-Holt inequality; Clauser et al., 
1969): 


C[A(a), B(b)] + CIA (a), B(b‘)] + CIA Ca"), B(b)] — CIA’), BŒ] < 2, (3.34) 


where a and a’ are detector settings on side A of the composite system, b and b’ 
are detector settings on side B, and C denotes correlation. This is a sharp limit: any 
correlation violating this inequality is nonlocal. 

Entanglement and nonlocality are not the same, however. Entanglement is a 
necessary condition for nonlocality, but more entanglement does not mean more 
nonlocality (Vidick and Wehner, 2011). Nonlocality is a more generic term: “There 
exist in nature channels connecting two (or more) distant partners, that can distribute 
correlations which can neither be caused by the exchange of a signal (the channel 
does not allow signalling, and moreover, a hypothetical signal should travel faster 
than light), nor be due to predetermined agreement...” (Scarani, 2006). 

Entanglement is a powerful resource that is often exploited in quantum computing 
and quantum information theory. This is reflected by the cost of simulating entangle- 
ment by classical composite systems: exponentially more communication is necessary 
between the component systems (Brassard et al., 1999). 


3.4 Evolution 


Unobserved, a quantum mechanical system evolves continuously and deterministi- 

cally. This is in sharp contrast with the unpredictable jumps that occur during a 

measurement (Section 3.5). The evolution is described by the Schrödinger equation. 
In its most general form, the Schrödinger equation reads as follows: 


ina WW) = Aly), (3.35) 


where H is the Hamiltonian operator, and ñ is Planck’s constant—its actual value is 
not important to us. The Hamiltonian characterizes the total energy of a system and 
takes different forms depending on the situation. 

In this context, the |y} state vector is also called the wave function of the quantum 
system. The wave function nomenclature justifies the abstraction level of the bra-ket 
notation: as mentioned in Section 3.1, a ket is simply a vector in a Hilbert space. If we 
think about the state as a wave function, this often implies that it is an actual function, 
an element of the infinite-dimensional Hilbert space of Lebesgue square-integrable 
functions. We, however, almost always use a finite-dimensional complex vector space 
as the underlying Hilbert space. A notable exception is quantum tunneling, where 
the wave function has additional explanatory meaning (Section 3.7). In turn, quantum 
annealing relies on quantum tunneling (Section 14.1); hence, it is worth taking note 
of the function space interpretation. 
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An equivalent way of writing the Schrödinger equation is with density matrices: 


ð 
n = [H, pl, (3.36) 


where [, ] is the commutator operator: [H, 0] = Hp — oH. 

The Hamiltonian is a Hermitian operator; therefore, it has a spectral decomposition. 
If the Hamiltonian is independent of time, the following equation gives the time- 
independent Schrödinger equation for the state vector: 


Ely) = Aly), (3.37) 


where E is the energy of the state, which is an eigenvalue of the Hamiltonian. Solving 
this equation yields the stationary states for a system—these are also called energy 
eigenstates. If we understand these states, solving the time-dependent Schrédinger 
equation becomes easier for any other state. The smallest eigenvalue is called the 
ground-state energy, which has a special role in many applications, including adiabatic 
quantum computing, where an adiabatic change of the ground state will yield the 
optimum of a function being studied (Section 3.8 and Chapter 14). An excited state is 
any state with energy greater than the ground state. 

Consider an eigenstate Yg of the Hamiltonian Hwy = Eo Wa. Taking the Taylor 
expansion of the exponential, we observe how the time evolution operator acts on this 
eigenstate: 


eTA ly, (0)) (3.38) 


1 H n 

= bs z (5) r) Iya (0)) (3.39) 
1 (Ey\" 

= (= a (=) r) [Wa (0)) (3.40) 


n 


=e Fath ya (0)). (3.41) 


We define U(H, t) = e i4t/h This is the time evolution operator of a closed 
quantum system. It is a unitary operator, and this property is why quantum gates 
are reversible. The intrinsically unitary nature of quantum systems has important 
implications for learning algorithms using quantum hardware. We often denote 
U(H, t) by the single letter U if the Hamiltonian is understood or is not important, 
and we imply time dependency implicitly. 

The evolution in the density matrix representation reads 


pt UpU'. (3.42) 


U is a linear operator, so it acts independently on each term of a superposition. The 
state is a superposition of its eigenstates, and thus its time evolution is given by 


WO) = X cae ™ "ya (0)). (3.43) 


a 
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The time evolution operator, being unitary, preserves the /2 norm of the state—that 
is, the probability amplitude will sum to 1 at every time step. This result means even 
more: U does not change the probabilities of the eigenstates, but only changes the 
phases. 

The matrix form of U depends on the basis. If we take any orthonormal basis, 
elements of the time evolution matrix acquire a clear physical meaning as the 
transition amplitudes between the corresponding eigenstates of this basis (Fayngold 
and Fayngold, 2013, p. 297). The transition amplitudes are generally time-dependent. 

The unitary evolution reveals insights into the nomenclature “probability ampli- 
tudes.” The norm of the state vector is 1, and the components of the norm are 
constant. The probability amplitudes, however, oscillate between time steps: their 
phase changes. 

A second look at Equation 3.41 reveals that an eigenvector of the Hamiltonian is an 
eigenvector of the time evolution operator. The eigenvalue is a complex exponential, 
which means U is not Hermitian. 


3.5 Measurement 


The state vector evolves deterministically as the continuous solution of the wave 
equation. All the while, the state vector is in a superposition of component states. 
What happens to a superposition when we perform a measurement on the system? 

Before we can attempt to answer that question, we must pay attention to an equally 
important one: What is being measured? It is the probability amplitude that evolves in 
a deterministic manner, and not a measurable characteristic of the system (Fayngold 
and Fayngold, 2013, p. 558). 

An observable quantity, such as the energy or momentum of a particle, is associated 
with a mathematical operator, the observable. The observable is a Hermitian operator 
acting on the state space M with spectral decomposition 


M=} Pi, (3.44) 
i 


where P; is a projector onto the eigenspace of the operator M with eigenvalue q;. In 
other words, an observable is a weighted sum of projectors. The possible outcomes 
of the measurement correspond to the eigenvalues œ;. Since M is Hermitian, the 
eigenvalues are real. 

The projectors are idempotent by definition, and they map to the eigenspace of the 
operator, they are orthogonal, and their sum is the identity: 


PiP; = ôijPi, (3.45) 
XOP: =]. (3.46) 
i 


Returning to the original question, we find a discontinuity occurs when a mea- 
surement is performed. All components of the superposition vanish, except one. We 
observe an eigenvalue a; with the following probability: 
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Pai) = (W|Pily). (3.47) 


Thus, the outcome of a measurement is inherently probabilistic. This formula is 
also called Born’s rule. The system will be in the following state immediately after 
measurement: 


Pily) 
~v (VIPily) 

This procedure is called a projective or von Neumann measurement. The measure- 
ment is irreversible and causes loss of information, as we cannot learn more about the 
superposition before the measurement. A simple explanation for the phenomenon is 
that the discontinuity arises from the interaction of a classical measuring instrument 
and the quantum system. More elaborate explanations abound, but they are not 
relevant for the rest of the discussion. 

The loss of information from a quantum system is also called decoherence. As 
the quantum system interacts with its environment— for instance, with the measuring 
instrument—components of the state vector are decoupled from a coherent system, 
and entangle with the surroundings. A global state vector of the system and the envi- 
ronment remains coherent: it is only the system we are observing that loses coherence. 
Hence, decoherence does not explain the discontinuity of the measurement, it only 
explains why an observer no longer sees the superposition. Furthermore, decoherence 
occurs spontaneously between the environment and the quantum system even if we do 
not perform a measurement. This makes the realization of quantum computing a tough 
challenge, as a quantum computer relies on the undisturbed evolution of quantum 
superpositions. 

Measurements with the density matrix representation mirror the projective mea- 
surement scheme. The probability of obtaining an output a; is 


(3.48) 


P(ai) = (YIPily) = (WIPiPily) (3.49) 
= (WIP; Pily) = E(P] Pil) (Wl) = tP} Pip). (3.50) 
The density matrix after measurement must be renormalized: 
PipP! 
ieee (3.51) 
tr(P; Pip) 


Projective measurements are restricted to orthogonal states: they project on an 
eigenspace. We may want to observe outputs that do not belong to orthogonal states. 
Positive operator—valued measures (POVMs) overcome this limitation. 

A POVM is set of positive Hermitian operators {P;} that satisfies the completeness 
relation: 


X Pi =1. (3.52) 


That is, the orthogonality and idempotency constraints of projective measurements are 
relaxed. Thus, a POVM represents a nonoptimal measurement that is not designed to 
return an eigenstate of the system. Instead, a POVM measures along unit vectors that 
are not orthogonal. 
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The probability of obtaining an output a; is given by a formula similar to that for 
projective measurements: 


Pai) = (WIPily). (3.53) 
Similarly, the system will have to be renormalized after measurement: 
P; 
um (3.54) 


v (V IPily) 

We may reduce a POVM to a projective measurement on a larger Hilbert space. 
We couple the original system with another system called the ancilla (Fayngold and 
Fayngold, 2013, p. 660). We let the joint system evolve until the nonorthogonal unit 
vectors corresponding to outputs become orthogonal. In this larger Hilbert space, 
the POVM reduces to a projective measurement. This is a common pattern in many 
applications of quantum information theory: ancilla systems aid understanding or 
implementing a specific target easier. 


3.6 Uncertainty Relations 


If two observables do not commute, a state cannot be a simultaneous eigenvector 
of both in general (Cohen-Tannoudji et al., 1996, p. 233). This leads to a form of 
the uncertainty relation similar to the one found by Heisenberg in his analysis of 
sequential measurements of position and momentum. This original relation states that 
there is a fundamental limit to the precision with which the position and momentum 
of a particle can be known. 

The expectation value of an observable A—a Hermitian operator—is (A) = 
(WIA|y). Its standard deviation is o4 = y (A?) — (A)?. In the most general form, the 
uncertainty principle is given by 


1 
caos = = (lA, B). (3.55) 


This relation clearly shows that uncertainty emerges from the noncommutativity of 
the operators. It implies that the observables are incompatible in a physical setting. 
The incompatibility is unrelated to subsequent measurements in a single experiment. 
Rather, it means that preparing identical states, |y), we measure one observable in one 
subset, and the other observable in the other subset. In this case, the standard deviation 
of the measurements will satisfy the inequality in Equation 3.6. 

As long as two operators do not commute, they will be subjected to a corre- 
sponding uncertainty principle. This attracted attention from other communities who 
apply quantum-like observables to describe phenomena, for instance, in cognitive 
science (Pothos and Busemeyer, 2013). 

Interestingly, the uncertainty principle implies nonlocality (Oppenheim and 
Wehner, 2010). The uncertainty principle is a restriction on measurements made 
on a single system, and nonlocality is a restriction on measurements conducted on 
two systems. Yet, by treating both nonlocality and uncertainty as a coding problem, 
we find these restrictions are related. 
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3.7 Tunneling 


Quantum tunneling is a phenomenon that has no classical counterpart, but that has 
proved to be tremendously successful in applications. In a machine learning context, 
it plays a part in quantum annealing (Section 14.1). 

In a classical system, assume that a moving object has a mechanical energy E, and 
it reaches a barrier with potential energy U(x). If E < U(x), the object will not be able 
to surmount the barrier. In quantum systems, there is such a possibility. 

A quantum particle may appear on either side of the barrier—after all, its location 
is probabilistic. The introduction of a potential barrier merely changes the distribution. 
A particle with definite E < U(x), and even E < U(x), can pass through the barrier, 
almost as if there were a “tunnel” cut across. As the potential barrier increases, the 
probability of tunneling decreases: taller and wider barriers will see fewer particles 
tunneling through. 

Let us denote the two turning points on the two sides of the barrier by x; and x2 
(Figure 3.1). The region xı < x < x2 is classically inaccessible. From the position 
indeterminancy Ax < |x2 — xı], it follows that, as Ap > h/2Ax, the particle has an 
indeterminancy in its energy: E + JAE. Thus, the particle’s energy may exceed the 
barrier, and the particle passes over the barrier. 

To understand why tunneling occurs, the wave nature of particles helps. The 
process resembles the total internal reflection of a wave in classical physics (Fayngold 
and Fayngold, 2013, p. 241). The total reflection does not happen exactly at the 
interface between the two media: it is preceded by partial penetration of the wave 
into the second layer. If the second layer has finite width, the penetrated part of the 
wave partially leaks out on the other side of the layer, leading to frustrated internal 
reflection, and causing transmission. 


3.8 Adiabatic Theorem 


The adiabatic theorem, originally proved in Born and Fock (1928), has important 
applications in quantum computing (Section 4.3 and Chapter 14). An adiabatic process 
changes conditions gradually so as to allow the system to adapt its configuration. If the 


U(x) 


Figure 3.1 Quantum tunneling through a potential barrier between the classically inaccessible 
region xj < x < x2. The particle that tunnels through has a decreased amplitude, but the same 
energy. 
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system starts in the ground state of an initial Hamiltonian, after the adiabatic change, 
it will end in the ground state of the final Hamiltonian. 

Consider a Hamiltonian Hp whose unique ground state can be easily constructed, 
and another Hamiltonian Hı whose ground-state energy we are interested in. To find 
this ground state, one considers the Hamiltonian 


A(A) = (1 — à)Ho + AM, (3.56) 


with à = à (f) as a time-dependent parameter with a range in [0,1]. The quantum 
adiabatic theorem states that if we start in the ground state of H(0) = Ho and we 
slowly increase the parameter À, then the system will evolve to the ground state of H1 
if H(A) is always gapped—that is, there is no degeneracy for the ground-state energy 
(Figure 3.2). The adiabatic theorem also states that the correctness of the change to 
the system depends critically on the time t; — to during which the change takes place. 
This should be large enough, and it depends on the minimum gap between the ground 
state and the first excited state throughout the adiabatic process. While this is intuitive, 
a gapless version of theorem also exists, in which case the duration must be estimated 
in some other way (Avron and Elgart, 1999). 


3.9 No-Cloning Theorem 


Cloning of a pure state |y} means a procedure with a separable state as an output: 
Iv) @ |v). 

We start by adding an ancilla system with the same state space, but the initial state 
is unrelated to the state being cloned: |Y) ® |0). Cloning thus means finding a unitary 
operator such that it evolves this initial state to the desired output: 


U (IY) @ 10)) = |v) 8 |p). (3.57) 


Since cloning should work for any state in the state space, U has a similar effect on 
a different state 6. Consider the inner product of y and ġ together with the ancilla in 
the product space: 


Ho A H 


Figure 3.2 Using the adiabatic theorem, we reach the ground-state energy of a simple 
Hamiltonian Ho, then we gradually change the Hamiltonian until we reach the target Hı. If 
enough time is allowed for the change, the system remains in the ground state throughout the 
process. 
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(O|(@|UTU|w)|0) 
= (ol(dl)ly). (3.58) 


Thus, (|Y) = (@|)*. Since states are unit vectors, this implies that either (@|yv) = 1 
or (¢|y) = 0. Yet, we required the two vectors to be arbitrary in the state space, so 
cloning is feasible only for specific cases. A unitary operator cannot clone a general 
quantum state. 

The principle extends to mixed states, where it is known as the no-broadcast 
theorem. 

The no-cloning theorem has important consequences for quantum computing: exact 
copying of data is not possible. Hence, the basic units of classical computing, such as 
a random access memory, require alternative concepts. 


(Ol(P| ly) 19) 
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Quantum Computing 


A quantum system makes a surprisingly efficient computer: a quantum algorithm 
may provide quadratic or even exponential speedup over the best known classical 
counterparts. Since learning problems involve tedious and computationally intensive 
calculations, this alone is a good reason to look at how quantum computers accelerate 
optimization. 

This chapter looks at the most important concepts, and we restrict ourselves to 
theoretical constructs: we are not concerned with the physical implementation of 
quantum computers. 

A qubit is the fundamental building block of quantum computing, and is a two- 
level quantum state (Section 4.1). By defining operations on single or multiple qubits, 
we can construct quantum circuits, which is one model of quantum computing 
(Section 4.2). An alternative model is adiabatic quantum computing, which manip- 
ulates states at a lower level of abstraction, using Hamiltonians (Section 4.3). 

Irrespective of the computational model, one of the advantages of quantum 
computing is quantum parallelism—an application of superposition (Section 4.4). 
Learning algorithms rely on this parallelism for a speedup either directly or through 
Grover’s search (Section 4.5). We highlight that the parallelism provided by quantum 
computers does not extend the range of calculable problems—some quintessential 
limits from classical computing still apply (Section 4.6). 

We close this chapter by mentioning a few concepts of information theory 
(Section 4.7). The fidelity of states will be particularly useful, as it allows the definition 
of distance functions that are often used in machine learning. 


4.1 Qubits and the Bloch Sphere 


Any two-level quantum system can form a qubit—for example, polarized photons, 
spin-1/2 particles, excited atoms, and atoms in ground state. 

A convenient choice of basis is {|0}, |1)}—this is called the computational basis. 
The general pure state of a qubit in this basis is |y) = a|0) + |1). Since |a|? + 
|B|*| = 1, we write 


Ive) =e (cos 210) +e! sin 211) ), (4.1) 
2 2 
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with 0 < 0 < 2,0 < ¢ < 2x. The global phase factor el’ has no observable effects; 
hence, the formula reduces to 


6 a. 0 
Iy) = cos 510) + e° sin 511). (4.2) 


With the constraints on 6 and ¢, these two numbers define a point on the surface of 
the unit sphere in three dimensions. This sphere is called the Bloch sphere. Its purpose 
is to give a geometric explanation to single-qubit operations (Figure 4.1). 

The Pauli matrices are a set of three 2 x 2 complex matrices which are Hermitian 
and unitary. They are 


1 = 1 
ox = ¢ a: oy = (; a): o: = G Ei (4.3) 


Together with the identity matrix J, they form a basis for the real Hilbert space of 
2 x 2 complex Hermitian matrices. Each Pauli matrix is related to an operator that 
corresponds to an observable describing the spin of a spin-1/2 particle, in each of the 
corresponding three spatial directions. 

In the density matrix representation, for pure states, we have 


20 1.-i¢ ging 
= (i pseo 2 pe ri 
p = |Y] (i sind sin? $ ) es 


Since any 2 x 2 complex Hermitian matrix can be expressed in terms of the identity 
matrix and the Pauli matrices, 2 x 2 mixed states—that is, 2 x 2 positive semidefinite 
matrices with trace 1—can be represented by the Bloch sphere. We write a Hermitian 


Figure 4.1 The Bloch sphere. 
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matrix as a real linear combination of {I, ox, Oy, 0z}, then we impose the positive 
semidefinite and trace 1 assumptions. Thus, a density matrix is written as pọ = 
5 + so), where o is a vector of the Pauli matrices, and s is called the Bloch vector. 
For pure states, this provides a one-to-one mapping to the surface of the Bloch sphere. 
Geometrically. 


sin 0 cos@ 
s= | sinf sing |, (4.5) 
cos 0 
where ||s|| = 1. For mixed states, the Bloch vector lies in the interior of the Bloch ball. 


With this interpretation, the computational basis corresponds to the Z axis of the 
Bloch sphere, and the eigenvectors of the Pauli matrix o,. We call the X axis of 
the Bloch sphere the diagonal basis, which corresponds to the eigenvectors of the 
Pauli matrix oy. The Y axis gives the circular basis, with the eigenvectors of the Pauli 
matrix oy. Pauli matrices are essentially rotations around the corresponding axes—for 
instance, about the X axis, we have Ry = el9ox/2_ 

Mixed states are in a way closer to classical states. The classical states are along 
the Z axis (the identity matrix is the origin). Quantum behavior emerges as we leave 
this axis. The two poles are of great interest: this is where pure states intersect with 
classical behavior. 

The Bloch sphere stores an infinite amount of information, but neighboring points 
on the Bloch sphere cannot be distinguished reliably. Hence, we must construct states 
in a way to be able to tell them apart. This puts constraints on the storage capacity. 

Multiple qubits are easy to construct as product spaces of individual qubits. The 
bases generalize from the one-qubit case. For instance, the computational basis of a 
two-qubit system is {|00), |01),|10),|11)}. A generic state is in the superposition of 
these basis vectors: 


IW) = ago|00) + 91/01) + œ10|10) + œ11|11), (4.6) 


where |æoo|? + læoil? + lool? + lar |? = 1. Unfortunately, the geometric insights 

provided by the Bloch sphere for single qubits do not generalize to multiple qubits. 
Entangled qubits may provide an orthonormal basis for the space of multiple qubits. 

For instance, the Bell states are maximally entangled, and they provide such a basis: 


1 


a 00) + |11)), 4.7 

lp) fe ) + |11)) (4.7) 
1 

lp b= eo —|11)), (4.8) 
ce. vill 

ly") = at 10)), (4.9) 
1 

Iy) = — (|01) — |10)). (4.10) 
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As is already apparent from the two-qubit case, the number of probability ampli- 
tudes is exponential in the number of qubits: an n-qubit system will have 2” probability 
amplitudes. Thus, it is tempting to think that qubits encode exponentially more 
information than their classical counterparts. This is not the case: the measurement 
puts a limit on the maximum number of bits represented. This is known as Holevo’s 
bound: n qubits encode at most n bits. Quantum information does not compress 
classical information. 


4.2 Quantum Circuits 


Quantum computing has several operational models, but only two are crucial for later 
chapters: quantum circuits and adiabatic quantum computing. Other models include 
topological quantum computing and one-way quantum computing. We begin with a 
discussion of quantum circuits. 

Quantum circuits are the most straightforward analogue of classical computers: 
wires connect gates which manipulate qubits. The transformation made by the gates 
is always reversible: this is a remarkable departure from classical computers. 

The circuit description uses simple diagrams to represent connections. Single wires 
stand for quantum states traveling between operations. Whether it is an actual wire, 
an optical channel or any other form of transmission is not important. Classical 
bits travel on double lines. Boxes represent gates and other unitary operations on 
one or multiple qubits. Since the transformation is always reversible, the number 
of input and output qubits is identical. Measurement is indicated by a box with a 
symbolic measurement device inside. Measurement in a quantum circuit is always 
understood to be a projective measurement, as ancilla systems can be introduced 
(Section 3.5). Measurement is performed in the computational basis, unless otherwise 
noted. 

The most elementary single-qubit operation is a quantum NOT gate, which takes 
|0) to |1), and vice versa. A generic quantum state in the computational basis will 
change from @|0) + 61) to a|1) + £10). 

Given a basis—usually the computational basis—we can represent quantum gates 
in a matrix form. If we denote the quantum NOT gate by the matrix X, it is defined as 


01 
X= k a (4.11) 


Coincidentally, the quantum NOT gate is identical with the Pauli o, matrix 
(Equation 4.3). 

Since states transform unitarily, such matrix representations for quantum gates 
are always unitary. The converse is also true: any unitary operation is a valid time 
evolution of the quantum state; hence, it defines a valid quantum gate. Whereas the 
only single-bit operation in classical systems is the NOT gate, single-qubit gates are 
in abundance. Not all of them are equally interesting, however. 
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Figure 4.2 A Hadamard gate. 


Two more gates, the Hadamard gate and the Z gate, are of special importance. The 
Hadamard gate is defined as 


1/11 
H= z ki (4.12) 


It is an idempotent operator: H? = J. It transforms elements of the computational basis 
to “halfway” between the elements. That is, it takes |0} into (10) + |1))/ J/2 and |1) 
into (10) — |1))/v/2. In other words, sending through an element of the computational 
basis will put it in an equal superposition. Measuring the resulting state gives 0 or 
1 with 50% probability. The symbol for a Hadamard gate in a circuit is show in 
Figure 4.2. Most gates follow this scheme: a box containing an acronym for the gate 
represents it in a quantum circuit. 
The Z gate is defined as 


1 0 
Z= (; a (4.13) 


It leaves |0) invariant, and changes the sign of |1). This is essentially a phase shift. 

While there are infinitely many single-qubit operations, they can be approximated 
to arbitrary precision with only a finite set of these gates. 

Moving on to multiple-qubit operations, we discuss the controlled- NOT (CNOT) 
gate next. The CNOT gate takes two input qubits: one is called the control qubit, 
the other is the target qubit. Depending on the value of the control qubit, a NOT 
operation may be applied on the target qubit. The control qubit will remain unchanged 
irrespective of the value either qubit inputs. 

The matrix representation of a CNOT gate is a 4 x 4 matrix to reflect the impact 
on both qubits. It is defined as 


000 


CNOT = (4.14) 


1 

010 
000 
001 


oro 


The CNOT gate is a generalized XOR gate: its action on a bipartite state |A, B) 
is |A,B@A), where @ is addition modulo 2—that is, a XOR operation. The 
representation of a CNOT gate in a quantum circuit is shown in Figure 4.3. 

The composite system also obeys unitary evolution; hence, every operation on mul- 
tiple qubits is described by a unitary matrix. This unitary nature implies reversibility: 
it establishes a bijective mapping between input and output bits. Given the output and 
the operations, we can always recover the initial state. This is generally not true for 
classical gates. For instance, a classical XOR gate is a surjection: given an output, 
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Figure 4.3 A CNOT gate. 


a oe 
|B) |A) 
(a) (b) 
Figure 4.4 The quantum swap operation. (a) A sequence of three CNOT gates results in a 
swap operation. (b) The swap operation in a circuit. 


ae 


Figure 4.5 A circuit to generate Bell states from the computational basis. 


there are two possible input configurations. The implications of this for learning 
algorithms are far-reaching, as they restrict the family of functions that can be learned 
directly, and leave nonlinearity or discontinuous manipulations to a measurement at 
the end of the unitary transformations. However, we must point out that arbitrary 
classical circuits can be made reversible by introducing ancilla variables. 

If a logic gate is irreversible, information is erased—some of the information 
is lost as an operation is performed. In a reversible scenario, no information is 
lost. Landauer’s principle establishes a relationship between the loss of information 
and energy: there is a minimum amount of energy required to change one bit of 
information, kT log 2, where k is the Boltzmann constant, and T is the temperature 
of the environment. If a bit is lost, at least this much of energy is dissipated into 
the environment. Since quantum computers use reversible gates only, theoretically 
quantum computing could be performed without expending energy. 

Three CNOT gates combine to create a swap operation (Figure 4.4). Given a state 
|A, B) in the computational basis, the gate operates as 


|A,B) > |A,A@® B) > |A@ (A@B),A@B) = |B,A@B) (4.15) 
+> |B,(A®B) @B) =|B,A). (4.16) 


A CNOT gate combined with a Hadamard gate generates the Bell states, starting 
with the computational basis. The computational basis is in a tensor product space, 
whereas the Bell states are maximally entangled: this gate combination allows us to 
create entangled states (Figure 4.5). 

The Toffoli gate operates on three qubits, and it has special relevance to classical 
computations. It has the effect |A, B, C) > |A, B, C@ AB)—that is, it flips the third 
qubit if the first two control qubits are set to 1. In other words, it generalizes the 


Quantum Computing 47 


Figure 4.6 A Toffoli gate. 


Figure 4.7 A Fredkin gate. 


CNOT gate further, which itself is a generalized XOR operation. We denote the Toffoli 
gate by 2XOR. The circuit representation of the Toffoli gate is shown in Figure 4.6. 
Increasing the number of control qubits, we can obtain generic nXOR gates. 

The Toffoli gate is an idempotent operator. The classical counterpart is universal: 
any classical circuit can be simulated by Toffoli gates: hence, every classical circuit 
has a reversible equivalent. The quantum Toffoli gate can simulate irreversible 
classical circuits; hence, quantum computers are able to perform any operation that 
classical computers can. 

The next important gate, the Fredkin gate has three input qubits, but only one is 
for control (Figure 4.7). If the control bit is set to 1, the target qubits are swapped, 
otherwise the state is not modified. The Fredkin gate is idempotent. The control qubit 
in this case is an ancilla: its value is irrelevant to the computation performed. Its 
presence ensures that the gate is reversible. 

The ancilla qubit may or may not change—it is not important for the result. Let 
us assume a function g represents the output in the ancilla; this is a garbage bit, its 
value is not important. By adding NOT gates, we can always ensure that the ancilla 
qubit starts from the state |0). The generic pattern of computation with an ancilla 
is thus 


|x,0) > [f@), 8@)). (4.17) 


By appending a CNOT gate before the calculation, we make a copy of x that 
remains unchanged at the end of the transformation: 


|x, 0,0) => |x,f(@), (@)). (4.18) 
Let us add a fourth register y, and add the result with a CNOT operation: 
|x,0,0,y) > |x, f@), ga), y BF). (4.19) 
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Apart from the final CNOT, the calculations did not affect y, and they were unitary; 
hence, if we apply the reverse operations, we get the following state: 


|x,0,0,y B fœ). (4.20) 


This procedure is known as uncomputation—we uncompute the middle two registers 
that are used as scratch pads. We often omit the middle ancilla qubits, and simply 
write 


xy) = ly BF). (4.21) 


This formula indicates that reversible computing can be performed without production 
of garbage qubits. 


4.3 Adiabatic Quantum Computing 


In adiabatic quantum computing, the aim is to find the global minimum of a given 
function f : {0,1}” > (0,00), where min, f(x) = fo and f(x) = fo if and only if x = 
xo—that is, there is a unique minimum (Kraus, 2013). We seek to find xo. To do so, 
we consider the Hamiltonian 


H= D> f@lx)tal, (4.22) 


xe{0,1}" 


whose the unique ground state is |x9). We take an initial Hamiltonian Ho and the 
Hamiltonian in Equation 3.56 to adiabatically evolve the system. Thus, if we measure 
the system in the computational basis at the end of this process, we obtain xo (Farhi 
et al., 2000). 

The gap between the ground state and the first excited states defines the computa- 
tional complexity: smaller gaps result in longer computational times. The gap depends 
on the initial and the target Hamiltonian. Adiabatic quantum computing speeds up 
finding an optimum by about a quadratic factor over classical algorithms. It is harder to 
argue whether exponential speedups are feasible. Van Dam et al. (2001) have already 
defined lower bounds for optimization problems. 

Adiabatic quantum computing is equivalent to the standard gate model of quantum 
computing (Aharonov et al., 2004), meaning that adiabatic quantum algorithms will 
run on any quantum computer (Kendon et al., 2010). Physical implementation is 
becoming feasible: adiabatic quantum computing has already demonstrated quantum 
annealing with over 100 qubits (Boixo et al., 2014), although the results are dis- 
puted (Rønnow et al., 2014). A further advantage of adiabatic quantum computing is 
that it is more robust against environmental noise and decoherence than other models 
of quantum computing (Amin et al., 2009; Childs et al., 2001). 

When it comes to machine learning, the advantage of adiabatic quantum com- 
puting is that it bypasses programming paradigms: the simulated annealing of the 
Hamiltonian is equivalent to minimizing a function—a pattern frequently encountered 
in learning formulations (see Section 11.2 and Chapter 14). 
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0, £(0)) +11, £0)) 
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Figure 4.8 A quantum circuit to demonstrate quantum parallelism. We evaluate the function 
f : {0,1} +> {0, 1} with an appropriate unitary Uy. By putting the data register in the 
superposition of the computational basis, we evaluate the function in its entire domain in one 
step. 


4.4 Quantum Parallelism 


While it is true that n qubits represent at most n bits, there is a distinct advantage 
in using quantum circuits. Consider a function f : {0,1} +> {0, 1}. If an appropriate 
sequence of quantum gates is constructed, it is possible to transform an initial state 
|x, y) to |x, y D f(x)) (see Equation 4.21). The first qubit is called the data register, and 
the second qubit is the target register. If y = 0, then we have |x, f(x)). We denote the 
unitary transformation that achieves this mapping by Uf. 

Suppose we combine Uy with a Hadamard gate on the data register—that is, we 
calculate the function on (|0) + |1))/2. If we perform Uy with y = 0, the resulting 
state will be 


10,f(0)) + IL, fQ)) 
V2 


A single operation evaluated the function on both possible inputs (Figure 4.8). This 
phenomenon is known as quantum parallelism. Typically, if we measure the state, we 
have a 50% chance of obtaining f (0) or f(1)—we cannot deduce both values from 
a single measurement. The fundamental question of quantum algorithms is how to 
exploit this parallelism without destroying the superposition. 

This pattern generalizes to n qubits: apply a Hadamard gate on each data register, 
then evaluate a function. The method is also known as the Hadamard transform: it 
produces a superposition of 2” states by using n gates. Adding a target register, we get 
the following state for an input state |0)®”|0): 


1 
Fm Le), (4.24) 


where the sum is over all possible values of x. 


; (4.23) 


4.5 Grover's Algorithm 


Grover’s algorithm finds an element in an unordered set quadratically faster than 
the theoretical limit for classical algorithms. The sought element defines a function: 
the function is evaluated as true on an element if it is the sought element. Grover’s 
algorithm uses internal calls to an oracle that tells the value of this function—that 
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Figure 4.9 Circuit diagram of Grover’s search. (a) The Grover operator. (b) The full circuit, 
where the Grover operator G is applied O(/N) times. 


is, whether membership is true for a particular instance. The goal is then to use the 
smallest possible number of applications of the oracle to find all elements that test true 
(Figure 4.9). 

Assume the data collection has N entries, with n = log N bits representing each 
entry. We start by applying a Hadamard transform on |0)®” to obtain the equal 
superposition state 


n—1 


1 
lv) = TÈ Ix). (4.25) 


We subsequently apply the Grover operator, also called the Grover diffusion 


operator, on this state a total of O(N) times. We denote this operator as G, and it 
consists of four steps: 


1. Apply a call to the oracle O. 
2. Apply the Hadamard transform H®”. 
3. Apply a conditional phase shift on the state with the exception of |0): 


w) > =D). (4.26) 
4. Apply the Hadamard transform H®” again. 
The combined effect of steps 2-4 is 
H®"(2|0)(0| — DH®™ = 2Y) (Y| — 1, (4.27) 


where |y} is the equally weighted superposition of states in Equation 4.25. Thus the 
short way of writing the Grover operator is 


G = (2|y) (y| — DO. (4.28) 


The implementation of the components of the Grover operator is efficient. The two 
Hadamard transforms require n operations each. The conditional phase shift is a 
controlled unitary operation and it requires O(n) gates. The complexity of the oracle 
calls depends on the application, but only one call is necessary per iteration. 

The key steps are outlined once more in Algorithm 1. 
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ALGORITHM 1 Grover’s algorithm 

Require: Initial state |o)®”. 

Ensure: Element requested. 
Initialize superposition by applying the Hadamard transform 
H®2/9) 82, 
for O(./N) times do 

Apply the Grover operator G 

end for 
Measure the system. 


Adiabatic quantum computers are able to reproduce the algorithm in O(/N) time 
(Roland and Cerf, 2002; see Section 3.56 and Chapter 14). 

Durr and Hoyer (1996) extended Grover’s search to find the minimum (or 
maximum) in an unsorted array. The algorithm calls Grover’s search to find the 
index of an item smaller than a particular threshold. This entry is chosen as the new 
threshold, and the procedure is repeated. The overall complexity is the same as for 
Grover’s algorithm. Since the array is unordered, if we are able to define the search 
space as discrete, this method finds the global optimum with high probability, which 
is tremendously useful in machine learning algorithms. 


4.6 Complexity Classes 


Although some quantum algorithms are faster than their best known classical coun- 
terparts, quantum computers are limited by the same fundamental constraints. Any 
problem that is solved by a quantum computer can be solved by a classical computer, 
given enough resources. A Turing machine simulates any quantum computer, and the 
Church-Turing thesis binds quantum algorithms. 

To gain insight into the actual power of quantum algorithms, let us take a look 
at complexity classes. Quantum computers run only probabilistic algorithms, and the 
class of problems that are solved efficiently by quantum computers is called bounded 
error, quantum, polynomial time (BQP). Bounded error means that the result of the 
computation will be correct with high probability. 

BQP is the subset of PSPACE, the set of problems that can be solved by a Turing 
machine using a polynomial amount of space. BQP’s relationship to other well-known 
classes is unknown. For instance, it is not known whether BQP is within NP, the set 
of problems for which a solution can be verified in polynomial time. NP-complete 
problems, a subset of NP with problems at least as hard as the ones in NP, are suspected 
of being disjoint from BQP. 

It is conjectured that BQP is strictly larger than P, the problems for which a solution 
can be calculated in polynomial time using a deterministic Turing machine. 
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4.7 Quantum Information Theory 


Quantum information theory generalizes classical information theory to quantum 
systems. Quantum information theory investigates the elementary processes of how 
quantum information is stored, manipulated, and transmitted. Its boundaries are 
vague—quantum computing might be included under this umbrella term. Quantum 
information theory is a rich field of inquiry, but we will restrict ourselves to a few 
fundamental concepts which will be relevant later. 

A core concept of classical information theory is entropy, which has a generaliza- 
tion for the quantum case. Entropy is a measure to quantify the uncertainty involved in 
predicting the value of a random variable. Classical systems use the Shannon entropy, 
which is defined as 


H(X) = — XPQ = x) logP(X = x). (4.29) 


Its value is maximum for the uniform distribution. In other words, elements of a 
uniform distribution are the most unpredictable; the entropy for this case is H(X) = 
logn, where the distribution has n values. 

If we extend this measure to a quantum system, the von Neumann entropy of a 
state/reduced state described by a density matrix p is given by 


S(p) = —tr(p log p). (4.30) 


The von Neumann entropy quantifies randomness—the randomness in the best 
possible measurement. 

A function on a matrix is defined on its spectral decomposition. Taking the 
decomposition of p = }_; Aili) (il, we get 


S(p) =- (k| S > ajlog ailé) til |k) = — So An logan. (4.31) 
k i n 


The density matrix of a pure state is a rank | projection: it is an idempotent operator. 
Hence, its von Neumann entropy is zero. Only mixed states have a von Neumann 
entropy larger than zero. The von Neumann entropy is invariant to basis change. 

Bipartite pure states are entangled if and only if their reduced density matrix is 
mixed. Hence, applying the von Neumann entropy of a state/reduced state described 
a good quantifier of entanglement. Maximally entangled states were mentioned in 
Section 3.3. The formal definition is that a state is maximally entangled if its von 
Neumann entropy is log n, where n is the dimension of the Hilbert state. Contrast this 
with the maximum entropy in the classical case. 

Not all concepts in quantum information theory have classical counterparts. 
Quantum distinguishability is an example. In principle, telling two classical sequences 
of bits apart is not challenging. This is generally not true for quantum states, unless 
they are orthogonal. For instance, consider two states, |0) and (|0) + |1))/ V2. A single 
measurement will not tell which state we are dealing with. Measuring the second one 
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will give 0 with probability 1/2. Repeated measurements are necessary to identify the 
state. 

Since the two states, o and o, are, in general, not orthogonal, we cannot perform 
projective measurements, and we must restrict ourselves to a positive operator—valued 
measure (POVM). With this in mind, fidelity measures the distinguishability of two 
probability distributions—the probability distributions after measurement on the two 
states. The selection of POVMs matters; hence, we seek the best measurement that 
distinguishes the two states the most. For pure states, fidelity thus measures the overlap 
between two states, |y) and |@): 


F(Y, b) = oly). (4.32) 


This resembles the cosine similarity used extensively in machine learning (Sec- 
tion 2.2). 
Fidelity on generic mixed or pure states is defined as 


F(p,o) =tr (Voye) i (4.33) 


The basic properties of fidelity are as follows: 


1. Although it is not obvious from the definition, fidelity is symmetric in the arguments, but it 
does not define a metric. 
2. Its range is in [0, 1], and it equals 1 if and only if p =o. 


3. F(p1 8 92,01 @ 02) = F(p1,01)F (02, 02). 
4. It is invariant to unitary transformations: F(U p Ut, Uo UÝ) = F(p,0). 


Quantum state tomography is a concept related to quantum distinguishability where 
we are given an unknown state p, and we must identify what state it is. Quantum 
process tomography generalizes this to identifying an unknown unitary evolution; this 
procedure is analogous to learning an unknown function (Chapter 14). 
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Part Two 


Classical Learning Algorithms 
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Unsupervised Learning 


Unsupervised learning finds structures in the data. Labels for the data instances or 
other forms of guidance for training are not necessary. This makes unsupervised 
learning attractive in applications where data is cheap to obtain, but labels are 
either expensive or not available. Finding emergent topics in an evolving document 
collection is a good example: we do not know in advance what those topics might be. 
Detecting faults, anomalous instances in a long time series, is another example: we 
want to find out if something went wrong and if it did, then we would like to know 
when. 

A learning algorithm, having no guidance, must identify structures on its own, 
relying solely on the data instances. Perhaps the most obvious approach is to start 
by studying the eigenvectors of the data, leading to geometric insights into the most 
prevalent directions in the feature space. Principal component analysis builds on this 
idea, and multidimensional scaling reduces the number of dimensions to two or three 
using the eigenstructure (Section 5.1). Dimensionality reduction is useful, as we are 
able to extract a visual overview of the data, but multidimensional scaling will fail 
to find nonlinear structures. Manifold learning extends the method to more generic 
geometric shapes in the high-dimensional feature space (Section 5.2). 

If we study the distances between data instances, we are often able to find groups 
of similar ones. These groups are called clusters, and for data sets displaying simple 
geometric structures, K-means and hierarchical clustering methods detect them easily 
(Sections 5.3 and 5.4). If the geometry of the data instances is more complex, density- 
based clustering can help, which in turn is similar to manifold learning methods 
(Section 5.5). 

Unsupervised learning is a vast field, and this chapter barely offers a glimpse. The 
algorithms in this chapter are the most relevant to quantum methods that have already 
been published. 


5.1 Principal Component Analysis 


Let us assume that the data matrix X, consisting of the data instances {x1,..., xy}, has 
a zero mean. Principal component analysis looks at the eigenstructure of X' X. This 
d x d square matrix, where d is the dimensionality of the feature space, is known as 
the empirical sample covariance matrix in the statistical literature. 
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If we calculate the eigendecomposition of X! X, we can arrange the normalized 
eigenvectors in a new matrix W. If we denote the diagonal matrix of eigenvalues by 
A, we have 


X'x=Waw'. (5.1) 


We arrange the eigenvalues in A in decreasing order, and match the eigenvectors in W. 
The principal component decomposition of the data matrix X is a projection to the 
basis given by the eigenvectors: 


T = XW. (5.2) 


The coordinates in T are arranged so that the greatest variance of the data lies on the 
first coordinates, with the rest of the variances following in decreasing order on the 
subsequent coordinates (Jolliffe, 1989). 

Using W, we can also perform projection to a lower-dimensional space, discarding 
some principal components. Let Wx denote the matrix with only K eigenvectors, 
corresponding to the K largest eigenvalues. The projection becomes 


Tx = XWx. (5.3) 


The matrix Tg has a feature space different from that of T, having only K columns. 
Among all rank K matrices, Tx is the best approximation to T for any unitarily 
invariant norm (Mirsky, 1960). Hence, this projection also minimizes the total squared 
error ||T — Tx||?. When projection to two or three dimensions is performed, this 
method is also known as multidimensional scaling (Cox and Cox, 1994). 

Oddly enough, quantum states are able to reveal their own eigenstructure, which is 
the foundation of quantum principal component analysis (Section 10.3). 


5.2 Manifold Embedding 


As principal component analysis and multidimensional scaling are based on the 
eigendecomposition of the data matrix X, they can only deal with flat Euclidean 
structures, and they fail to discover the curved or nonlinear structures of the input 
data (Lin and Zha, 2008). 

Manifold learning means a broad range of algorithms which assume that data in a 
high-dimensional space align to a manifold in a space of much lower dimensions. The 
method may also assume which manifold it is—this is typical of two-dimensional or 
three-dimensional embeddings which project the data onto a sphere or a torus (Ito 
et al., 2000; Onclinx et al., 2009). Methods that embed data points in a higher- 
dimensional space but still lower than the original, do not typically assume a specific 
manifold. Instead, they have assumptions about certain properties of the manifold— 
for instance, that it must be Riemannian (Lin and Zha, 2008). 

As a prime example of manifold learning, Isomap extends multidimensional 
scaling to find a globally optimal solution for an underlying nonlinear manifold 
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(Tenenbaum et al., 2000). A shortcoming of Isomap is that it fails to find nonconvex 
embeddings (Weinberger et al., 2004). The procedure is outlined in Algorithm 2. 


ALGORITHM 2 Isomap 

Require: Initial data points {x1,..., Xy}. 

Ensure: Low-dimensional embedding of data points. 
Find neighbors of each data point. 
Construct neighborhood graph. 
Estimate geodesic distances between all data points on the 
manifold by computing their shortest path distances in the graph. 
Minimize distance between geodesic distance and embedding. 
Euclidean distance matrices using eigendecomposition. 


A major drawback of Isomap is its computational complexity. Finding the neigh- 
bors has O(N?) complexity, calculating the geodesic distances with Dijkstra’s algo- 
rithm has O(N? log N) steps, and the final eigendecomposition is cubic. 

If the size of the matrix that is subject to decomposition changes, updates to the 
embedding are necessary: iterative update is also possible by extensions (Law and 
Jain, 2006; Law et al., 2004). A new data point may introduce new shortest paths, 
and this is addressed by a modified version of Dijkstra’s algorithm. Overall time 
complexity is the same as that of the original. 

An extension called L-Isomap chooses landmark points to improve the run 
time (De Silva and Tenenbaum, 2003). Landmark selection avoids expensive, 
quadratic programming computations (Silva et al., 2006). Other extensions of Isomap 
are too numerous to mention, but this method highlights the crucial ideas in manifold 
learning methods that are useful for developing their quantum version (Section 10.4). 


5.3 K-Means and K-Medians Clustering 


The K-means algorithm—also called the K-nearest neighbors algorithm—is a method 
to cluster data instances on the basis of their pairwise distances into K partitions. It 
tries to minimize overall intracluster variance. 

The algorithm starts by partitioning the input points into K initial sets, either at 
random or using some heuristic data. It then calculates the centroid of each set as 
follows: 


e= Xx; (5.4) 


where Ne is the number of vectors in the subset. It constructs a new partition 
by associating each point with the closest centroid. The centroid-object distances 
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are computed by the cosine dissimilarity or by other distance functions. Then the 
centroids are recalculated for the new clusters, and the process is repeated by alternate 
application of these two steps until their is convergence, which is obtained when the 
points no longer switch clusters. The convergence is toward a local minimum (Bradley 
and Fayyad, 1998). 

The algorithm does not guarantee a global optimum for clustering: the quality of 
the final solution depends largely on the initial set of clusters. In text classification, 
however, where extremely sparse feature spaces are common, K-means has been 
proved to be highly efficient (Steinbach et al., 2000). 

Unlike linear models, K-means does not divide the feature space linearly; hence, 
it tends to perform better on linearly inseparable problems (Steinbach et al., 2000). 
The most significant disadvantage is its inefficiency with regard to classification time. 
Linear classifiers consider a simple dot product, while in its simplest formulation, 
K-means requires the entire set of training instances ranked for similarity with the 
centroids (Lan et al., 2009). 

Similarly to Equation 5.1 in principal component analysis, we may calculate 
the eigendecomposition of the covariance matrix of the data instances from XX". 
The principal directions in this decomposition are identical to the cluster centroid 
subspace (Ding and He, 2004). This way, principal component analysis and K-means 
clustering are closely related, and by the appropriate projection, the search space 
reduces to where the global solution for clustering lies, helping to find near optimal 
solutions. 

The quantum variant of K-means enables the fast calculation of distances, and an 
exponential speedup over the classical variant is feasible if we allow the input and 
output vectors to be quantum states (Section 10.5). 

K-medians is a variation of K-means. In K-means, the centroid seldom coincides 
with a data instance: it lies between the data points. K-medians, on the other hand, 
calculates the median instead of the mean; hence, the representative element of a 
cluster is always a data instance. Unlike K-means, K-medians may not converge: 
it may oscillate between medians. K-medians is more robust and less sensitive to 
noise than K-means; moreover, it does not actually need the data points, it needs 
only the distances between them, and hence the algorithm works with using a Gram 
matrix alone (Section 7.4). Its computational complexity, however, is higher than 
for K-means (Park and Jun, 2009). It has, however, an efficient quantum version 
(Section 10.6). 


5.4 Hierarchical Clustering 


Hierarchical clustering is as simple as K-means, but instead of there being a fixed 
number of clusters, the number changes in every iteration. If the number increases, 
we talk about divisive clustering: all data instances start in one cluster, and splits 
are performed in each iteration, resulting in a hierarchy of clusters. Agglomerative 
clustering, on the other hand, is a bottom-up approach: each instance is a cluster at the 
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beginning, and clusters are merged in every iteration. With use of either method, the 
hierarchy will have N — 1 levels (Hastie et al., 2008). 

This way, hierarchical clustering does not provide a single clustering of the data, 
but provides clustering of N — 1 of them. It is up to the user to decide which one fits 
the purpose. Statistical heuristics are sometimes employed to aid the decision. 

The arrangement after training, the hierarchy of clusters, is often plotted as a 
dendrogram. Nodes in the dendrogram represent clusters. The length of an edge 
between a cluster and its split is proportional to the dissimilarity between the split 
clusters. The popularity of hierarchical clustering is related to the dendrograms: these 
figures provide an easy-to-interpret view of the clustering structure. 

Agglomerative clustering is more extensively researched than divisive clustering. 
Yet, the quantum variant (Section 10.7) is more apt for the divisive type. The classical 
divisive clustering algorithm begins by placing all data instances in a single cluster 
Co. Then, it chooses the data instance whose average dissimilarity from all the other 
instances is the largest. This is the computationally most expensive step, having Q (N?) 
complexity in general. The selected data instance forms the first member of a second 
cluster C1. Elements are reassigned from Co to Cı as long as their average distance 
to Co is greater than that to C1. This forms one iteration, after which we have two 
clusters, what remained from the original Co, and the newly formed C1. The procedure 
continues in subsequent iterations. The iterative splitting of clusters continues until all 
clusters contain only one data instance, or when it is no longer possible to transfer 
instances between clusters using the dissimilarity measure. Outliers are quickly 
isolated with this method, and unbalanced clusters do not pose a problem either. 


5.5 Density-Based Clustering 


Density-based clustering departs from the global objective functions in K-means 
and hierarchical clustering. Instead, this approach deals with local neighborhoods, 
resembling manifold learning methods. The data instances are assumed to alternate in 
high-density and low-density areas, the latter type separating the clusters. The shape 
of the clusters can thus be arbitrary, it does not even have to be convex (Kriegel et al., 
2011). 

Density-based spatial clustering of applications with noise is the most famous 
example (Ester et al., 1996). It takes two input parameters, € and Nmin. The parameter 
€ defines a neighborhood {x; € X|d(x;, xj) < €} of the data instance x;. The minimum 
points parameter Nmin defines a core object, a point with a neighborhood consisting 
of more elements than this parameter. 

A point x; is density-reachable from a core object x; if a finite sequence of core 
objects between x; and x; exists such that each belongs to an €- neighborhood of its 
predecessor. Two points are density-connected if both are density-reachable from a 
common core object. 

Every point that is reachable from core objects can be factorized into maximally 
connected components serving as clusters. The points that are not connected to any 
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core point can be considered as outliers, because they are not covered by any cluster. 
The run time complexity of this algorithm is O(N log N) 

With regard to the two parameters € and Nmin, there is no straightforward way to 
fit them to data. To overcome this obstacle, the algorithm orders the data instances to 
identify the clustering structure. This order augments the clusters with an additional 
data structure while remaining consistent with density-based clustering (Ankerst et al., 
1999). Instead of just one point in the parameter space, this algorithm covers a 
spectrum of all different €’ < €. The constructed ordering is used either automatically 
or interactively to find the optimal clustering. 


Pattern Recognition and Neural 
Networks 


A neural network is a network of units, some of which are designated as input and 
output units. These units are also called neurons. The units are connected by weighted 
edges—synapses. A unit receives a weighted input based on its connections, and 
generates a response of a univariate or multivariate form—this is the activation of 
a unit. In the simplest configuration, there is only one neuron, which is called a 
perceptron (Section 6.1). 

If the network consists of more than one neuron, a signal spreads across the network 
starting from the input units, and either spreads one way toward the output units, or 
circulates in the network until it achieves a stable state. Learning consists of adjusting 
the weights of the connections. 

Neural networks are inspired by the central nervous systems of animals, but they 
are not necessarily valid metaphors. Computational considerations make the update 
cycle different from that of natural systems. Furthermore, rigorous analysis from the 
theory of statistical learning also introduced changes that took neural networks further 
away from their biological counterparts. 

Memorizing and recognizing a pattern are emergent processes of interconnected 
networks of simple units. This approach is called connectionist learning. The topology 
of the network is subject to infinite variations, which gave rise to hundreds of neural 
models (Sections 6.2 and 6.3). 

While the diversity of neural networks is bewildering, a common characteristic is 
that the activation function is almost exclusively nonlinear in all neural network ar- 
rangements. The artificial neurons are essentially simple, distributed processing units. 
A clear delineation of subtasks is absent; each unit is performing a quintessentially 
similar computation in parallel. Recent advances in hardware favor this massively 
parallel nature of neural networks, allowing the development of extremely large 
connectionist models (Sections 6.4 and 6.5). 


6.1 The Perceptron 


The simplest type of neural network classifiers is the perceptron, consisting of a single 
artificial neuron (Rosenblatt, 1958). It is a linear discriminant: it cannot distinguish 
between linearly inseparable cases (Minsky and Papert, 1969). 
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The single neuron in the perceptron works as a binary classifier (Figure 6.1). It 
maps an input x € Rf to a binary output value, which is produced by a Heaviside step 
function: 

fowl 
f@ = 1 ifw IFESAN (6.1) 
0 otherwise, 
where w is a vector weight—a weight in this vector corresponds to a feature in the 
feature space—and b is the bias term. The purpose of the bias term is to change the 
position of the decision plane. This function is called the activation function. 

Training of the perceptron means obtaining the weight vector and the bias term 
to give the correct class for the training instances. Unfortunately, training is not 
guaranteed to converge: learning does not terminate if the training set is not linearly 
separable. The infamous XOR problem is an example: the perceptron will never learn 
to classify this case correctly (Figure 6.2). It is not surprising to find a set of four 


Figure 6.1 A perceptron input consists of the d-dimensional data instances. With a response 
function f, it produces an output f (x) 
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Figure 6.2 The XOR problem: points in a plan in this configuration can never be separated by 
a linear classifier. 
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points on the plane that the perceptron cannot learn, as the planar perceptron has a 
Vapnik-Chervonenkis dimension of three. 

If the problem is linearly separable, however, the training of the perceptron will 
converge. The procedure is known as the delta rule, and it is a simple gradient descent. 
We minimize the error term 


N 
1 
BS OTe (6.2) 
i= 
We initialize the weights and the threshold. Weights may be initialized to zero or to a 
random value. We take the simple partial derivative in w; in the support of f: 


OE 
zy 570 FO). (6.3) 
nj 
The change to w should be proportional to this, yielding the updated formula for the 
weight vector: 


Aw; = y Qi — fi) is (6.4) 


where y is a predefined learning rate. 

The capacity of a linear threshold perceptron for large d is two bits per 
weight (Abu-Mostafa and St. Jacques, 1985; MacKay, 2005). This capacity is vastly 
expanded by the quantum perceptron (Section 11.2). 


6.2 Hopfield Networks 


A Hopfield network is a simple assembly of perceptrons that is able to overcome the 
XOR problem (Hopfield, 1982). The array of neurons is fully connected, although 
neurons do not have self-loops (Figure 6.3). This leads to K(K — 1) interconnections 
if there are K nodes, with a w;; weight on each. In this arrangement, the neurons 
transmit signals back and forth to each other in a closed-feedback loop, eventually 
settling in stable states. 
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Figure 6.3 A Hopfield network with the number of nodes K matching the number of input 
features d. 
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An important assumption is that the weights are symmetric, wij = wji, for neural 
interactions. This is unrealistic for real neural systems, in which two neurons are 
unlikely to act on each other symmetrically. 

The state s; of a unit is either +1 or —1. It is activated by the following rule: 


+1 if DU wijsj 2 0i; 


Si = (6.5) 
—1 otherwise, 


where 6; is a threshold value corresponding to the node. This activation function 
mirrors that of the perceptron. 

The activation of nodes happens either asynchronously or synchronously. The 
former case is closer to real biological systems: a node is picked to start the 
update, and consecutive nodes are activated in a predefined order. In synchronous 
mode, all units are updated at the same time, which is much easier to deal with 
computationally. 

In a model called Hebbian learning, simultaneous activation of neurons leads to 
increments in synaptic strength between those neurons. The higher the value of a w;j 
weight, the more likely that the two connected neurons will activate simultaneously. 
In Hopfield networks, Hebbian learning manifests itself in the following form: 


N 
1 
wi = 5 2 Ay (6.6) 


Here x, is in binary representation—that is, the value x,; is a bit for each i. 

Hopfield networks have a scalar value associated with each neuron of the network 
that resembles the notion of energy. The sum of these individual scalars gives the 
“energy” of the network: 


1 
E= E D WijSiSj + D 6; Si. (6.7) 
ij i 


If we update the network weights to learn a pattern, this value will either remain the 
same or decrease, hence justifying the name “energy.” The quadratic interaction term 
also resembles the Hamiltonian of a spin glass or an Ising model, which some models 
of quantum computing can easily exploit (Section 14.3). 

A Hopfield network is an associative memory, which is different from a pattern 
classifier, the task of a perceptron. Taking hand-written digit recognition as an 
example, we may have hundreds of examples of the number three written in various 
ways. Instead of classifying it as number three, an associative memory would recall a 
canonical pattern for the number three that we previously stored there. We may even 
consider an associative memory as a form of noise reduction. 

The storage capacity of this associative memory—that is, the number of patterns 
that are stored in the network—is linear in the number of neurons. Estimates depend 
on the strategy used for updating the weights. With Hebbian learning, the estimate is 
about N < 0.15K. The quantum variant of Hopfield networks provides an exponential 
increase over this (Section 11.1). 


Pattern Recognition and Neural Networks 67 


6.3 Feedforward Networks 


A Hopfield network is recurrent: the units form closed circles. If the number of output 
nodes required is lower, the storage capacity is massively improved by a nonrecurrent 
topology, the feedforward network. In a feedforward neural network, connections 
between the units do not form a directed cycle: information moves in only one 
direction, from the input nodes through an intermediate layer known as the hidden 
layer to the output nodes (Figure 6.4). Use of multiple hidden layers is sometimes 
advised. Rules for determining the number of hidden layers or the number nodes 
are either absent or ad hoc for a given application. The number of hidden units in 
the neural network affects the generalization performance, as the layers increase the 
model complexity (Rumelhart et al., 1994). 

The input units represent the features in the feature space; hence, there are d 
nodes in this layer. The output units often represent the category or categories in a 
classification scenario; the number of nodes in this layer corresponds to this. If the 
output nodes are continuous and do not represent categories, we may view the network 
as a universal approximator for functions (Hornik et al., 1989). Following the model 
of Hebbian learning, the weights on the edges connecting the units between the layers 
represent dependency relations. 

Feedforward networks use a variety of learning techniques. Often these are 
generalizations of the delta rule for training a perceptron—the most popular technique 
is called back-propagation. 

The back-propagation algorithm starts with random weights wj on the synapses. 
We train the network in a supervised fashion, adjusting the weights with the arrival 
of each new training instance at the input layer. The changes in the weights are 
incremental and depend on the error produced in the output layer, where the 
output values are compared with the correct answer. Back-propagation refers to the 
feedback to the network through the adjustment of weights depending on the error 
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Figure 6.4 A feedforward neural network. The number of nodes in the input layers matches 
the dimensions of the feature space (d). The hidden layer has K nodes, and the output layer has 
M nodes, which matches the number of target classes. 
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function (Rumelhart et al., 1986). The scheme relies on a gradient descent; hence, the 
error function and the activation function should be differentiable. 

This procedure repeats for a sufficiently large number of training cycles. In fact, one 
of the disadvantage of feedforward networks is that they need many training instances 
and also many iterations. The network weights usually stabilize in a local minimum of 
the error function, in which case we say that the network learned the target function. 
Given a sufficient number of hidden layers, the network is capable of learning linearly 
inseparable functions. 

To describe the training procedure more formally, we start with loading the feature 
weights xj, to the input units for classifying a test object x;. The activation of these 
units is then propagated forward through the network in subsequent layers. This 
activation function is usually nonlinear. In the final output layer, the units determine 
the categorization decision. 

We study the squared error of the output node j for an input i: 


1 
Ey = 50% — fi(xi))’, (6.8) 


where fj(x;) represents the response of the node j. 

Let us consider a network with only one hidden layer of K nodes and with M output 
nodes. Let us split the weights w; into two subsets, ajx,0>i<N,1>k<K, and 
Pij 0> i< K,1>j <M. The weights a; are on the edges leading from the input 
layer to the hidden layer, and the weights ;; are on the edges leading from the hidden 
layer to the output layer. Then, we can write the output of a node in the hidden layer 
for an input vector x; as 


zk = g(x + @ Xi), (6.9) 
and the final output as 
FED = g(Boj + L; 2), (6.10) 


where z is a vector of the zg entries, and g denotes the activation function—a nonlinear, 
differentiable function. A common choice is the logistic function: 


g(x) = Sy (6.11) 


With this choice of activation function, a network with a single hidden layer is 
identical to the logistic regression model, which is widely used in statistical modeling. 
This function is also known as the sigmoid function (see the kernel types in 
Section 7.4). It has a continuous derivative, and it is also preferred because its 
derivative is easily calculated: 


g (x) = gwd — g(x). (6.12) 
We calculate the partial derivative of the error function with respect to the weights: 
OE 


he — (i — fxd)’ (B; ziti, (6.13) 
JK 
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and 
dE, ij 
kl 


M 
= — $ Oi — fn D) (Bn 21) Bk (org Xi) Xi. (6.14) 


m=1 


Only the absolute value of the gradients matters. With a learning rate y that 
decreases over the iterations, we have the updated formulas: 


N OE; 
Bik = Bie -v Yo. (6.15) 
i=1 Bix 
N 
OE; 
on =a -—y >> a (6.16) 


i=1 

Just as in the delta rule of the perceptron, in a learning iteration, we calculate 
the error term of the current weights: this is the forward pass in which the partial 
derivatives gain their value. In the backward pass, these errors are propagated to the 
weights to compute the updates of the weights, hence the name back-propagation. 

As the objective function is based on the error terms in Equation 6.8, the surface on 
which we perform the gradient descent is nonconvex in general. Hence, it is possible 
that the procedure only reaches a local minimum; there are no guarantees that a global 
optimum will be reached. 

The weights in the network are difficult to interpret, although in some cases rules 
can be extracted on the basis of their absolute value and sequence (Lu et al., 1996). 
Quantum neural networks replace weights with entanglement (Section 11.3). 

The storage capacity for a network with a single hidden layer is still linear in the 
number of neurons. The coefficient, however, is better than in a Hopfield network, as N 
patterns need only N nodes in the hidden layer (Huang and Babri, 1998). A two-layer 
network can store O(K7/m) patterns, where m is the number of output nodes (Huang, 
2003). 

Back-propagation is a slow procedure; hence, developing faster training mecha- 
nisms is of great importance. One successful example is extreme learning machines, 
which perform a stochastic sampling of hidden nodes in the training iterations (Huang 
et al., 2006). Speed is also an advantage of quantum neural networks, with the 
added benefit that they may also find a global optimum with the nonconvex objective 
function. 


6.4 Deep Learning 


A long training time is a major issue that prevented the development of neural 
networks with several layers of hidden nodes for a long time. Deep learning overcomes 
this problem by computational tricks, including learning by layers and exploiting 
massively parallel architectures. 

In deep learning, the representation of information is distributed: the layers in the 
network correspond to different levels of abstraction. Increasing the number of layers 
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and the number of nodes in the layers leads to different levels of abstraction. This 
contrasts with shallow architectures, such as support vector machines, in which a 
single input layer is followed by a single layer of trainable coefficients. Deep learning 
architectures have the potential to generalize better with less human intervention, 
automatically building the layers of abstraction that are necessary for complex 
learning tasks (Bengio and LeCun, 2007). 

The layers constitute a hierarchy: different concepts are learned from other 
concepts in a lower level of abstraction. Hence, toward the end of the learning pipeline, 
high-level concepts emerge. The associated loss function is often nonconvex (Bengio 
and LeCun, 2007). 

Deep learning often combines unsupervised and supervised learning. Thus, deep 
learning algorithms make use of both unlabeled and labeled data instances, not unlike 
semisupervised learning. Unsupervised learning helps the training process by acting as 
a regularizer and aid to optimization (Erhan et al., 2010). Deep learning architectures 
are prone to overfitting because the additional layers allow the modeling of rare or 
even ad hoc dependencies in the training data. Attaching unsupervised learners to 
preprocessing or postprocessing which is unrelated to the main model helps overcome 
this problem (Hinton et al., 2012). 


6.5 Computational Complexity 


Given the diversity of neural networks, we cannot possibly state a general com- 
putational complexity for these learning algorithms. The complexity is topology- 
dependent, and it also depends on the training algorithm, and the nature of the neurons. 

A Hopfield network’s computational complexity will depend on the maximum ab- 
solute value of the weights, irrespective of whether the update function is synchronous 
or asynchronous (Orponen, 1994). The overall complexity is O(N? max;; |wi). 

In a feedforward network with N training instances and a total of w weights, 
each epoch requires O(Nw) time. The number of epochs can be exponential in d, 
the number of input nodes (Han et al., 2012, p. 404). 

If we look at computational time rather than time complexity, parallel computing 
can greatly decrease the amount of time that back-propagation takes to converge. To 
improve parallel efficiency, we must use a slight simplification: the update of weights 
in Equations 6.15 and 6.16 introduces dependencies that prevent parallelism. If we 
batch the updates to calculate the error over many training instances before the update 
is done, we remove this bottleneck. 

In this batched algorithm, the training data are broken up into equally large parts for 
each thread in the parallel architecture. The threads perform the forward and backward 
passes, and corresponding weights are summed locally for each thread. At regular 
training intervals, the updates are shared between the threads to refresh the weights. 
The communication involved is limited to sharing the updates. 

Simulated annealing helps speed up training, which also ensures convergence to 
the global optimum. 
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Gradient descent and back-propagation extend to training a deep learning network. 
As we noted earlier, their computational cost is high, but they are easy to implement 
and they are guaranteed to converge to a local minimum. The computational cost fur- 
ther increases with the sheer number of free parameters in a deep learning model: the 
number of abstraction layers, the number of nodes in the various layers, learning rates, 
initialization of weight vectors—these have to be tuned over subsequent independent 
training rounds. Massively parallel computer architectures enable the efficient training 
of these vast networks. There is little data dependency between individual neurons; 
hence, graphics processing units and similar massively parallel vector processors are 
able to accelerate the calculations, often reducing the computational time by a factor 
of 10-20 (Raina et al., 2009). Computations further scale out in a distributed network 
of computing nodes, enabling the training of even larger deep learning models. 
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Supervised Learning and Support 7 
Vector Machines 


The primary focus in this chapter is on support vector machines, the main learning 
algorithm that derives from the Vapnik-Chervonenkis (VC) theory. Before dis- 
cussing various facets of support vector machines, however, we take a brief de- 
tour to another, more simplistic supervised learner, the K-nearest neighbors (Sec- 
tion 7.1). This algorithm has an efficient quantum variant, and it also applies to 
a form of regression (Chapter 8), which in turn has an entirely different quantum 
formulation. 

The development of support vector machines started with optimal margin hyper- 
planes that separate two classes with the highest expected generalization performance 
(Section 7.2). Soft margins allow noise training instances in cases where the two 
classes are not separable (Section 7.3). 

With the use of using kernel functions, nonlinearity is addressed, allowing the 
embedding of data into a higher-dimensional space, where they become linearly 
separable, but still subject to soft margins (Section 7.4). “Feature space” in the 
machine learning literature refers to the high-dimensional space describing the 
characteristics of individual data instances (Section 2.2). In the context of support 
vector machines, however, this space is called input space. The reason is that the 
input space is often mapped to higher-dimensional space to tackle nonlinearities in 
the data, and the embedding space is called feature space. This is further explained in 
Section 7.4. 

One quantum variant relies on a slightly altered formulation of support vector 
machines that uses least-squares optimization, which we discuss in Section 7.5. 

Support vector machines achieve an outstanding generalization performance 
(Section 7.6), although the extension to multiclass problems is tedious (Section 7.7). 

We normally choose loss functions to have a convex objective function, primarily 
for computational reasons (Section 7.8). A quantum formulation may liberate support 
vector machines from this constraint. Computational complexity can be estimated 
only in certain cases, but it is at least polynomial both in the number of training 
instances and in the number of dimensions (Section 7.9). Not surprisingly, the 
quantum variant will improve this. 
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7.41 K-Nearest Neighbors 


In the K-nearest neighbors algorithm, calculations only takes place when a new, 
unlabeled instance is presented to the learner. Seeing the new instance, the learner 
searches for the K most nearby data instances—this is the computationally expensive 
step that is easily accelerated by quantum methods (Section 12.1). Among those K 
instances, the algorithm will select the class which is the most frequent. 

The construction of a K-nearest neighbor classifier involves determining a thresh- 
old K indicating how many top-ranked training objects have to be considered. Larkey 
and Croft (1996) used K = 20, while others found 30 < K < 45 to be the most 
effective (Joachims, 1998; Yang and Chute, 1994; Yang and Liu, 1999). 

The K-nearest neighbors algorithm does not build an explicit, declarative repre- 
sentation of the category c;, but it has to rely on the category labels attached to the 
training objects similar to the test objects. The K-nearest neighbors algorithm makes 
a prediction based on the training patterns that are closest to the unlabeled example. 
For deciding whether x; € cx, the K-nearest neighbors algorithm looks at whether the 
K training objects most similar to x; also are in cx ; if the answer is positive for a large 
enough proportion of them, a positive decision is taken (Yang and Chute, 1994). This 
instance-based learning is a form of transduction. 


7.2 Optimal Margin Classifiers 


A support vector machine is a supervised learning algorithm which learns a given in- 
dependent and identically distributed set of training instances {(x1, y1),..., (Xv, YN)}, 
where y € {—1, 1} are binary classes to which data points belong. 

A hyperplane in R¢ has the generic form 


w'x—b=0, (7.1) 


where w is the normal vector to the hyperplane, and the bias parameter b helps 
determine the offset of the hyperplane from the origin. 

We assume that the data instances are linearly separable—that is, there exists a 
hyperplane that completely separates the data instances belonging to the two classes. 
In this case, we look for two hyperplanes such that there are no points in between and 
we maximize their distance. The area between the hyperplanes is the margin. 

In its simplest, linear form, a support vector machine is a hyperplane that separates 
a set of positive examples from a set of negative examples with maximum margin. The 
distance between the two planes is || = ||; hence, minimizing w will lead to a maximal 
margin, which in turn leads to good generalization performance. The formula for the 
output of a linear support vector machine is 


Îi = sign (w? x; + b), (7.2) 


where x; is the ith training example. With this, the conditions for data instances for 
not falling into the margin are as follows: 
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wix,-b>1 for y; = 1, 
T (7.3) 
w x;—b< -l1 for y; = —1. 
These conditions can be written briefly as 
yi(w'x; —b)>1, i=1,...,N. (7.4) 


The optimization is subject to these constraints, and it seeks the optimal decision 
hyperplane with 


argmin II wII7. (7.5) 
w.b 2 

The margin is also equal to the distance of the decision hyperplane to the nearest 
of the positive and negative examples. Support vectors are the training data that lie on 
the margin (Figure 7.1). 

While this primal formulation of the linear case has a linear time complexity 
solution (Joachims, 2006), we more often refer to the dual formulation of the problem. 
To obtain the dual formulation, first we introduce Lagrange multipliers œ; to include 
the constraints in the objective function: 


N 
. 1 2 T 
apne (zri - 2 abi x; — b) — u) (7.6) 
The œ; corresponding to nonsupport vectors will be set to zero, as they do not make a 
difference in finding the saddle point of the expanded objective function. We denote 
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Figure 7.1 An optimal margin classifier maximizes the separation between two classes, where 
the separation is measured as the distance between the margins. Support vectors are the data 
instances that lie on the margins. 
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the objective function in Equation 7.6 by L(w, b, œ). Setting the derivatives of w and 
b to zero, we get 


OL 


OW; 
N 
ðL 
== X ai =0. 
ðb z 


From this, we find that the solution is a linear combination of the training vectors: 


= wi — &iyiXi = 0, 


T) 


N 
w= X aixi. (7.8) 
i=l 


Inserting this into the objective function in Equation 7.6, we are able to express the 
optimization problem solely in terms of œ;. We define the dual problem with the a; 
multipliers, 


i= 


N 
1 T: 
er 2 ai — 2 a QjQA;VIV;X; Xj (7.9) 
ij 
subject to (for any i = 1,..., N) 


a; > 0, (7.10) 


and from the minimization in b, the additional constraint is 


N 
X ayi =0. (7.11) 
i=l 


Eventually, only a few œ; will be nonzero. The corresponding x; are the support vectors 
that lie on the margin. Thus, for the support vectors, y;(wx; — b) = 1. Rearranging the 
equation, we get the bias term b = w! x; — y;. We average it over all support vectors 
to get a more robust estimate: 


a 1 T an ; 
oe ier eon 2 (W ai= yi). (7.12) 


1:0; 40 


7.3 Soft Margins 


Not all classification problems are linearly separable. A few instances of one class 
may mingle with elements of the other classes. The hard-margin classifier introduced 
in Section 7.2 is not able to efficiently learn a model from such data. Even if the 
classes are separable, the dependence on the margin makes a hard-margin support 
vector machine sensitive to just a few points that lie near the boundary. 
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To deal with such cases, nonnegative slack variables help. These measure the 
degree to which a data instance deviates from the margin. The optimization becomes 
a tradeoff between maximizing the margin and controlling the slack variables. A 
parameter C balances the tradeoff—it is also called a penalty or cost parameter. 

The condition for soft-margin classification with the slack variables becomes 


wx; tb>1-& ify =+, (7.13) 
wixjtb<-14+4 ify;=—-1. (7.14) 


The primal form of the optimization is to minimize ||w|| and the amount of deviation 
described by the slack variables, subject to the above constraints: 


_ i 2 
min 5 liw]| +0) 08. (7.15) 


The corresponding Lagrangian, analogous to Equation 7.6, is 


N N N 
: 1 
argmin max (zw +C) 8 — > oilyi(w' x; — b) — 1+ £i] - a) l 
wéb % i=1 i=1 i=1 


(7.16) 


with aj, i > 0. Taking the partial derivatives, we derive relations similar to 
Equation 7.7, with the additional equality C — 6; — œ; = 0, which together with 
Êi = 0 implies that a; < C. 


Thus, the dual formulation is similar to the hard-margin case: 


N 
1 T, 
a > aj — 2 5 Q&idjyiyjX; Xj (7.17) 
i=1 ij 
subject to 


0<a; <C, i=1,...,N, 


N 
X ani =0. 
i=1 


Nonzero slack variables occur when an a; reaches its upper limit. The cost 
parameter C acts as a form of regularization: if the cost of misclassification is higher, 
a more accurate model is sought with increased complexity—that is, with a higher 
number of support vectors. 


(7.18) 


7.4 Nonlinearity and Kernel Functions 


Maximum margin classifiers with soft margins were already an important step 
forward in machine learning, as they have an exceptional generalization performance 
(Section 7.6). What makes support vector machines even more powerful is that they 
are not restricted to linear decision surfaces. 
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The dual formulation enables a nonlinear kernel mapping that maps the data 
instances from an input space into a higher-dimensional—possibly even infinite- 
dimensional—embedding space, which is called feature space in this context. The core 
idea is to replace the inner product x; Xj in the objective function in Equation 7.17 by 
a function that retains many properties of the inner product, yet which is nonlinear. 
This function is called a kernel. 

We embed the data instances with an embedding function ¢ that classifies points as 


yi(w' oxi) +b)>1-& & 20, i=1,...,N. (7.19) 


Within these constraints, we look for the solution of the following optimization 
problem: 


N 
min =w? + C) gi (1.20) 
i=1 
The objective is identical to the soft-margin case (Equation 7.15). 

As in the previous sections, we introduce Lagrangian multipliers to accommodate 
the constraints in the minimization problem. The partial derivatives in w, b, and & 
define a saddle point of the Lagrangian, with which the dual formulation becomes the 
following quadratic programming problem: 


N 
1 
max) oi — 5D aiagyiyiK (Xi, x;) (1.21) 
i=1 ij 
subject to 


O<a;<C, i=1,...,N, 


N 
X aiyi =0. 
i=l 


The function K(x;, xj) = (xi) pa) is the kernel function, the dot product of the 
embedding space. 

In the dual formulation, the kernel function bypasses calculation of the embedding 
itself. We use the kernel function directly to find the optimum and subsequently 
to classify new instances—we do not need to define an actual embedding. This is 
exploited by many kernels. For the same reason, the embedding space can be of infinite 
dimensions. Bypassing the @ embedding is often called the kernel trick. 

Technically, any continuous symmetric function K (x;, xj) € L2 @ L2 may be used 
as an admissible kernel, as long as it satisfies a weak form of Mercer’s condition 
(Smola et al., 1998): 


(7.22) 


J / K(xi,x/ g(x) g(xj) > 0 forall g € L(R%). (7.23) 


That is, all positive semidefinite kernels are admissible. This condition ensures that 
the kernel “behaves” like an inner product. 
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We do not need the ¢@ embedding to classify new data instances either. For an 
arbitrary new data point x, the decision function for a binary decision problem 
becomes 


N 
f(x) = sign (>: aiyiK (xi, x) + | (7.24) 
i=1 

This formula reveals one more insight: the learned model is essentially instance- 
based. The support vectors are explicitly present through the kernel function. Since 
the parameter a; is controlling the instances, and there is an additional bias term b, 
it is not a pure form of transduction, but it is closely related. Since classification is 
based on a few examples in a single step, support vector machines are considered 
shallow learners, as opposed to the multilayer architecture of deep learning algorithms 
(Section 6.4). 

Some important kernels are listed in Table 7.1. The linear kernel does not have an 
embedding; this case is identical to the one described in Section 7.3. 

The polynomial kernel has a bias parameter c and a degree parameter d. It is a good 
example to illustrate the kernel trick. For d = 2, the explicit embedding of the input 
space into the feature space is given by 


o(x) = Gi ae aes /2x1x9, sacs V2x1Xd—1, V2x2x3, eee J 2x4-1Xds (7.25) 


V2cx1,...,V 2CXq, C) 


The kernel trick bypasses this embedding into a ( 
and allows the use of just d + 1 multiplications. 

The radial basis function kernel has no explicitly defined embedding, and it 
operates in an infinite-dimensional space. The parameter y controls the radius of the 
basis function. If no prior knowledge is available to adjust its value, y = 1/207 is a 
reasonable default value, where o is the standard deviation of the data. 

This approach of infinite-dimensional feature spaces gave rise to combining 
wavelets with support vector machines, proving that these kernels satisfy the Mercer 
condition while also addressing niche classification problems (Wittek and Tan, 2011; 
Zhang et al., 2004). 

A sigmoid kernel is also frequently cited (tanh(y (x;, xj) + c), where y and c are 
parameters), but this kernel is not positive definite for all choices of the parameters, 
and for the valid range of parameters, it was shown to be equivalent with the radial 
basis function kernel (Lin and Lin, 2003). A support vector machine with a sigmoid 
kernel is also equivalent to a two-layer neural network with no hidden layers—a 
perceptron (Section 6.1). 


d+2) _ d?4+3d+2_4: : 
2 ) = —5 -dimensional space, 


Table 7.1 Common Kernels 


Linear Polynomial Radial Basis Function 


Kax) xiy  G/x +o? exp(—y |x; — xjll7) 
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The objective function of the dual formulation in Equation 7.21 depends on the data 
point via the kernel function. If we evaluate the kernel function on all possible data 
pairs, we can define a matrix with elements kj = K(x;,x;), Lj = 1,...,N. We call 
this the kernel or Gram matrix. Calculating it efficiently is key to deriving efficient 
algorithms, as it encodes all information about the data that a kernel learning model 
can obtain. 

The Gram matrix is symmetric by definition. As the kernel function must be 
positive definite, so must the Gram matrix. Naturally, the Gram matrix is invariant 
to rotations of the data points, as inner products are always rotation-invariant. 

Let Kı and K> be kernels over X x X, X c Rf, ae Rt, letf be a real-valued 
function on X and in L2(X), let K3 be a kernel over Y x Y, Y C RI, 0 : X — Y, and 
let B be a symmetric positive semidefinite N x N matrix. Furthermore, let p(x) be a 
polynomial with positive coefficients over R. For all x,z € X, x',z’ € Y, the following 
functions are also kernels (Cristianini and Shawe-Taylor, 2000): 


1. K(x, z) := K,(x,z) + K2 (x, Z). 
K(x,zZ) := aK, (x, Z). 

K(x,z) := Kı (x, z)K2(x, Z). 
K(x, z2) := f œf (2). 

K(x,z) := K3 (0 (x), 0(2)). 
K(x,Z) := x! Bz. 

K(x, Z) := p(K1 (x, 2)). 

K(x, z) := exp(K1 (x, Z)). 


SPANAMN EWN 


These properties show that we can easily create new kernels from existing ones using 
simple operations. 

The properties apply to kernel functions, but we can apply further transformations 
on the Gram matrix itself, as long as it remains positive semidefinite. We center the 
data—that is, move the origin of the feature space to the center of mass of the data 
instances. The sum of the norms of the data instances in the feature space is the trace 
of the matrix, which is, in turn, the sum of its eigenvalues. Centering thus minimizes 
the sum of eigenvalues. 

While the Gram matrix is likely to be full-rank matrix, a subspace projection 
can be beneficial through a low-rank approximation. As in methods based on rank 
reduction through singular value decomposition, low-rank approximations act as a 
form of denoising the data. 


7.5 Least-Squares Formulation 


Least-squares support vector machines modify the goal function of the primal problem 
by using the l2 norm in the regularization term (Suykens and Vandewalle, 1999): 


M 
noha 1 + ¥ 2 
Minimize zv w+ p > - ei (7.26) 
i= 
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subject to the equality constraints 
yi(w' o(xi) +b) =1—e, i=1,...,N. (7.27) 


The parameter y plays the same role as the cost parameter C. Seeking the saddle point 
of the corresponding Lagrangian, we obtain the following least-squares problem: 


0 17 b 0 
Ge 7128) 


where K is the kernel matrix, and 1 is a vector of 1’s. The least-squares support vector 
machine trades off zero a; for nonzero error terms e;, leading to increased model 
complexity. 


7.6 Generalization Performance 


To guarantee good generalization performance, we expect a low VC dimension to give 
a tight bound on the expected error (Section 2.5). Oddly, support vector machines may 
have a high or even infinite VC dimension. Yet, we are able to establish limits on the 
generalization performance. 

The VC dimension of the learner depends on the Hilbert space H to which the 
nonlinear embedding maps. If the penalty parameter C is allowed to take all values, 
the VC dimension of the support vector machine is dim(H) + 1 (Burges, 1998). For 
instance, for a degree-two polynomial kernel, it is a and for a radial basis 
function kernel, this dimension is infinite. 

Despite this, we are able to set a bound on the generalization error (Shawe-Taylor 


and Cristianini, 2004). With probability 1 — 6, the bound is 
1 J/—W(a*) 4 In(2/5) 
— tr(K) +3 3 7.29 
CN CN” + ev r(K) + 3,/ aN (7.29) 
where œ* is the solution of the dual problem of the soft-margin formulation, W(a*) = 


-1/2 
—5 Dij eiajyiyjx] Xj, and y* = (Elia) . In accordance with the general 


principles of structural risk minimization, sparsity will improve this bound. 


7.7 Multiclass Problems 


The formulations of support vector machines considered so far was originally 
designed for binary classification. There are two basic approaches for multiclass 
classification problems. One involves constructing and combining several binary 
classifiers, and the other involves directly considering all data in one optimization 
formulation. The latter approach, the formulation to solve multiclass support vector 
machine problems in one step has variables proportional to the number of classes. 
Thus, for multiclass support vector machine methods, either several binary classifiers 
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have to be constructed or a larger optimization problem is needed. In general, it is 
computationally more expensive to solve a multiclass problem than a binary problem 
with the same amount of data (Hsu and Lin, 2002). 

The earliest implementation used for multiclass classification was the one-against- 
all method. It constructs M models, where M is the number of classes. The ith 
support vector machine is trained with all of the examples in the ith class with 
positive labels, and all other examples with negative labels. Thus, given training data 


{(X1,y1),---, (Xv, yy) }, the ith model solves the following problem: 
N 
tT i 
min ~(W) w +C i 7.30 
wi, bigi 2 ) dé ( ) 
W'a >I- ify=i, (7.31) 
(wo) +b <-1+8 ify ži, (7.32) 
éi>0, j=1,...,N. (7.33) 
After the problem has been solved, there are M decision functions: 
(wD oa) +0! (7.34) 
: (7.35) 
(wi) "(x +b", (7.36) 


An x is in the class which has the largest value of the decision function: 


class of x = arg max (w')' b(x) + bi, (7.37) 
i=1,..., 


Addressing the dual formulation results in M N-variable quadratic problems having to 
be solved. 

Another major method is the one-against-one method. This method constructs 
M(M — 1)/2 classifiers, where each one is trained on data from two classes. For 
training data from the ith and the jth classes, the following binary classification 
problem is solved: 


=, 1 ijh T i ij 
(w? TEE a : 7.38 
(wi) "o(%) +> 1-8) ify =i, (7.39) 
WHT) +b < -1+6 ify =), (7.40) 
U>0, ve (7.41) 


Thus, M(M — 1)/2 classifiers are constructed. 

The one-against-all and the one-against-one methods were shown to be superior 
to the formulation to solve multiclass support vector machine problems in one step, 
because the latter approach tends to overfit the training data (Hsu and Lin, 2002). 
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7.8 Loss Functions 


Let us revisit the objective function in soft-margin classification in Equation 7.15. A 
more generic form is given by 


1 
min 5|Iw|? + CD) Lif (i), (7.42) 


where L(y;,f(x;)) is a loss function. The optimal loss function is the 0-1 loss, which 
has the value O if y; = f(x;), and 1 otherwise. This is a nondifferentiable function 
which also leads to a nonconvex objective function. The most commonly used function 
instead is the hinge loss: 


Li f (xi)) = max(0, 1 — yif (x;)), (7.43) 


where f(x;) is without the threshold—it is the raw output of the kernel expansion. 
The positive semidefinite nature of the kernel function implies the convexity of 
the optimization (Shawe-Taylor and Cristianini, 2004, p. 216). Positive semidefinite 
matrices form a cone, where a cone is a subspace closed under addition and 
multiplication by nonnegative scalars, which implies the convexity. 

The number of support vectors increases linearly with the number of training 
examples when using a convex loss function (Steinwart, 2003). Given this theoretical 
result, the lack of sparsity prevents the use of convex support vector machines in 
large-scale problems. Convex loss functions are also sensitive to noise, especially 
label noise, and outliers. These are the core motivating factors to consider nonconvex 
formulations. 

Depending on how the examples with an insufficient margin are penalized, it is easy 
to derive a nonconvex formulation, irrespective of the kernel used. When nonconvex 
loss functions are used, sparsity in the classifier improves dramatically (Collobert 
et al., 2006), making support vector machines far more practical for large data sets. 
The ramp loss function is the difference of two hinge losses, controlling the score 
window in which data instances become support vectors; this leads to improved 
sparsity (Ertekin et al., 2011). We can also directly approximate the optimal 0-1 loss 
function, instead of relying on surrogates like the hinge loss (Shalev-Shwartz et al., 
2010). If we can estimate the proportion of noise in the labels, we can also derive 
a nonconvex loss function, and solve the primal form directly with quasi-Newton 
minimization (Stempfel and Ralaivola, 2009). Nonconvex support vector machines 
are an example of the rekindled interest in nonconvex optimization (Bengio and 
LeCun, 2007). We discuss further regularized objective functions and loss functions 
in Section 9.4. 


7.9 Computational Complexity 


A support vector with a linear kernel can be trained in linear time using the primal 
formulation (Joachims, 2006), but in general, limits are difficult to establish. The 
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quadratic optimization problem is usually solved by sequential minimal optimization 
of the dual formulation, which chooses a subset of the training examples to work 
with, and subsequent iterations change the subset. This way, the learning procedure 
avoids costly numerical methods on the full quadratic optimization problem. While 
there are no strict limits on the speed of convergence for training, sequential minimal 
optimization scales between linear and quadratic time in the number of training 
instances on a range of test problems (Platt, 1999). 

The calculations are dominated by the kernel evaluation. Given a linear or 
polynomial kernel, the calculation of entry in the kernel matrix takes O(d) time; 
thus, calculating the whole kernel matrix has O(N7d) time complexity. Solving 
the quadratic dual problem or the least-squares formulation has O(N*) complexity. 
Combining the two steps, the classical support vector machine algorithm has at least 
O(N?(N + d)) complexity. This complexity can be mitigated by using spatial support 
structures for the data (Yu et al., 2003). 

The quantum formulation yields an exponential speedup in these two steps, leading 
to an overall complexity of O(log(Nd)) (Section 12.3). 


Regression Analysis 


In regression, we seek to approximate a function given a finite sample of training 
instances {(X1,¥1),-.-,(Xy,yn)}, resembling supervised classification. Unlike in 
classification, however, the range of y; is not discrete: it can take any value in R. 
The approximating function f is also called the regression function. 

A natural way of evaluating the performance of an approximating function f is the 
residual sum of squares: 


N 
E= Qi- foD. (8.1) 
i=1 

Minimizing this value over arbitrary families of functions leads to infinitely many 
solutions. Moreover, the residual sum of squares being zero does not imply that the 
generalization performance will be good. We must restrict the eligible functions to a 
smaller set. 

Most often, we seek a parametric function estimate—that is, we seek the optimum 
in a family of functions characterized by parameters. The optimization is performed 
on the parameter space. We further refine function classes to linear estimates 
(Section 8.1) and nonlinear estimates (Section 8.2). The optimization process may 
also be regularized to ensure better generalization performance. The corresponding 
quantum method is based on process tomography, which optimizes fit over certain 
families of unitary transformations (Chapter 13). 

Nonparametric regression follows a different paradigm. It requires larger sample 
sizes, but the predictor does not take a predefined form (Section 8.3). 


8.1 Linear Least Squares 


Linear least squares is a parametric regression method. If the feature space is R, we 
assume that the approximating function has the form 


f(x, B) = B'x, (8.2) 


where £ is a parameter vector. This problem is also known as general linear regression. 
We seek to minimize the squared residual 
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N 
E(B) = X Gi -F&: BY. (8.3) 


i=1 
If we arrange the data instances in a matrix, the same formula becomes 
2 
E(B) = |y — XB||" = Y- XB)" (y — XB) (8.4) 
=y'y—B'X'y—y'XB+B'X'XB, 
where element i of the column vector y is yj. 


Differentiating this with respect to B, we obtain the normal equations, which are 
written as 


X'xp=X'y. (8.5) 


The solution is obtained by matrix inversion: 
B=(X'X) X'y. (8.6) 


An underlying assumption is that the matrix X has full rank—that is, the training 
instances are linearly independent. 

The method also applies to quadratic, cubic, quartic, and higher polynomials. For 
instance, if x € R, then the approximating function becomes 


fœ B) = Bo + Bix + Box”. (8.7) 
For higher-order polynomials, orthogonal polynomials provide better results. 
We may also regularize the problem, either to increase sparsity or to ensure the 
smoothness of the solution. The minimization problem becomes 
2 


E(B) = |y — XB|| + Px (8.8) 


where I is an appropriately chosen matrix, the Tikhonov matrix. 


8.2 Nonlinear Regression 


In a nonlinear approximation, the combination of the model parameters and the 
dependency on independent variables is not linear. Unlike in linear regression, there is 
no generic closed-form expression for finding an optimal fit of parameters for a given 
family of functions. 

Support vector machines extend to nonlinear regression problems—this method is 
called support vector regression (Drucker et al., 1997; Vapnik et al., 1997). Instead of 
a binary value of y;, the labels take an arbitrary real value. 

The approximating function is a linear combination of nonlinear basis functions, 
the kernel functions. This linear combination is parameterized by the number of 
support vector vectors: as in the case of classification problems, the model produced 
by support vector regression depends only on a subset of the training data—that is, the 
parameters are independent of the dimensions of the space. 
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The support vector formulation implies the cost function of the optimization 
ignores training data instances close to the model prediction—the optimization is 
regularized. The formulation most often uses a hinge loss function for regression. 


8.3 Nonparametric Regression 


Parametric regression is indirect: it estimates the parameters of the approximating 
function. Nonparametric regression, on the other hand, estimates the function directly. 
Basic assumptions apply to the function: it should be smooth and continuous (Hiardle, 
1990). Otherwise, nonparametric modeling accommodates a flexible form of the 
regression curve. 

Smoothing is a form of nonparametric regression that estimates the influence of 
the data points in a neighborhood so that their values predict the value for nearby 
locations. The local averaging procedure is defined as 


N 
f(x) = > 2 wilX)yi, (8.9) 


where w;(x) is the weight function describing the influence of data point y; at x. 

For instance, the extension of the K-nearest neighbors algorithm to regression 
problems will assign a zero weight for instances not in the K-nearest neighbors of a 
target x, and all the others will have an equal weight N/K. The parameter K regulates 
the smoothness of the curve. 

Kernel smoothing regulates smoothness through the support of a kernel function. 
One frequent approximation is the Nadaraya-Watson estimator: 


oe Kra — xyi 
DNK- x) 
where A is the support or bandwidth of the kernel. 


Spline smoothing aims to produce a curve without much rapid local variation. It 
optimizes a penalized version of the squared residuals: 


fa) = 


(8.10) 


N X, 
E=) Qi- fa)? + af ” P'an. (8.11) 
i=1 XI 


8.4 Computational Complexity 


In linear least-squares regression, solving the normal equation (Equation 8.5) is 
typically done via singular value decomposition. The overall cost of this is O(dN7) 
steps. Matrix inversion has an efficient quantum variant, especially if the input and 
output are also quantum states (Section 10.3). Bypassing inversion, quantum process 
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tomography directly seeks a unitary transformation over certain groups, which is also 
similar to parametric regression (Chapter 13). 

In nonparametric regression, we usually face polynomial complexity. Using the K- 
nearest neighbors algorithm, we find the complexity is about cubic in the number of 
data points. The most expensive step is finding the nearest neighbors, which is done 
efficiently by quantum methods (Section 12.1). 


Boosting 


Boosting is an ensemble learning method that builds multiple learning models to 
create a composite learner. Boosting differs from bagging mentioned in Section 2.6: it 
explicitly seeks models that complement one another, whereas bagging is agnostic to 
how well individual learners deal with the data compared with one another. 

Boosting algorithms iteratively and sequentially train weak classifiers with respect 
to a distribution, and add them to a final strong classifier (Section 9.1). As they are 
added, they are weighted to reflect the accuracy of each learner. 

One learner learns what the previous one could not: the relative weights of the 
data instances are adjusted in each iteration. Examples that are misclassified by the 
previous learner gain weight, and examples that are classified correctly lose weight. 
Thus, subsequent weak learners focus more on data instances that the previous weak 
learners misclassified. 

Sequentiality also means that the learners cannot be trained in parallel, unlike 
in bagging and other ensemble methods. This puts boosting at a computational 
disadvantage. 

The variation between boosting algorithms is how they weight training data 
instances and weak learners. Adaptive boosting (AdaBoost) does not need prior 
knowledge of the error rate of the weak learners; rather, it adapts to the accuracies 
observed (Section 9.2). 

Generalizing AdaBoost, a family of convex objective functions define a range of 
similar boosting algorithms that overcome some of the shortcomings of the original 
method (Section 9.3). Yet, convex loss functions will always struggle with outliers, 
and this provides strong motivation to develop nonconvex objective functions. These, 
however, always require computational tricks to overcome the difficulty of finding the 
optimum (Section 9.4). 


9.1 Weak Classifiers 


If the base learners are simple, they are referred to as decision stumps. A decision 
stump classifies an instance on the basis of the value of just a single input feature (Iba 
and Langley, 1992). Decision stumps have low variance but high bias. Yet, such simple 
learners perform surprisingly well on common data sets, for instance, compared with 
full decision trees (Holte, 1993). 
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A decision tree classifier is a tree in which internal nodes are labeled by features. 
Branches departing from them are labeled by tests on the weight that the feature has in 
the test object. Leaves are labeled by target categories (Mitchell, 1997). The classifier 
categorizes an object x; by recursively testing for the weights that the features labeling 
the internal nodes have in vector x;, until a leaf node is reached. The label of this node 
is then assigned to x;. Experience shows that for trees, a depth between four and eight 
works best for boosting (Hastie et al., 2008, p. 363). 

A method for learning a decision tree for category cg consists of the following 
divide-and-conquer strategy (Lan et al., 2009): 


e Check whether all the training examples have the same label; 
* If they do not, choose a feature j, partition the training objects into classes of objects that 
have the same value for j, and place each such class in a separate subtree. 


The above process is recursively repeated on the subtrees until each leaf of the 
tree generated contains training examples assigned to the same category cg. The 
selection of the feature j on which to operate the partition is generally made according 
to an information gain (Cohen and Singer, 1996; Lewis and Ringuette, 1994) or 
entropy (Sebastiani, 2002) criterion. A fully grown tree is prone to overfitting, as 
some branches may be too specific for the training data. Therefore, decision tree 
methods normally include a method for growing the tree and also one for pruning 
it, thus removing the overly specific branches (Mitchell, 1997). 

The complexity of individual learners is less important: the diversity of learners 
usually leads to better overall performance of the composite classifier (Kuncheva and 
Whitaker, 2003). The diversity measure can be as simple as the correlation between 
the outputs of the weak learners; there does not appear to be a significant difference 
in applying various diversity measures. Decision stumps and small decision trees are 
examples of simple weak learners that can add to the variety. 

Formally, in boosting, we iteratively train K weak classifiers, {h1,h2,...,hx}. 
The combined classifier weights the vote of each weak classifier as a function of its 
accuracy. In an iteration, we either change the weight of one weak learner, this process 
is known as corrective boosting, or change the weight of all of them, leading to totally 
corrective boosting. 


9.2 AdaBoost 


AdaBoost adapts to the strengths and weaknesses of its weak classifiers by empha- 
sizing training instances in subsequent classifiers that were misclassified by previous 
classifiers. For a generic outline, see Algorithm 3. 

AdaBoost initially assigns a weight w; to each training instance. The value of the 
weight is 1/N, where N is the number of training instances. 

In each iteration, a subset is sampled from the training set. Subsequent training sets 
may overlap, and the sampling is done with replacement. The selection probability of 
an instance equals its current weight. Then we train a classifier on the sample—the 
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error is measured on the same set. The error of a weak classifier’s error rate at an 
iteration ź is given by 


N 
E = wily # hy(X))). (9.1) 
i=1 
Weights of the data instances are adjusted after training the weak classifier. If an 
instance is incorrectly classified, its weight will increase. If it is correctly classified, its 
weight will decrease. Hence, data instances with a high weight indicate that the weak 
learners struggle to classify it correctly. Formally, the weight of a correctly classified 
instance is multiplied by E;/(1 — E;). Once all weights have been updated, they are 
normalized to give a probability distribution. 
Since the probability is higher that difficult cases make it to a sample, more 
classifiers are exposed to them, and the chances are better that one of the learners 
will classify them correctly. Hence, classifiers complement one another. 


ALGORITHM 3 AdaBoost 
Require: Training and validation data, number of training 
iterations T 
Ensure: Strong classifier 
Initialize weight distribution N over training samples as uniform 
distribution Vi: d(i) =1/N. 
for t=1 to t=T do 
From the family of weak classifiers H, find the classifier 
he that maximizes the absolute value of the difference of 
the corresponding weighted error rate Ee and 0.5 with respect 
to the distribution d(s): hę =argmaxn,<y|0.5 — Etl, where 
Ee = Dina AA) (yi F he (i). 
if |0.5— E| <6, where B is a previously chosen threshold, then 
Stop. 
end if 
Qat < Et/(1 — Et) 
for i=1 to N do 
d(i) < d(i) exp{æt[2I (yi F he(xi)) — 1]} 
Normalize d(i) < d(i)/ X3- d(j) 
end for 
end for 


In the composite classifier, a weight is also assigned to each weak learner. The 
lower the error rate of a classifier, the higher its weight should be. The weight is 
usually chosen as 

1 1-£, 


Qt = —In 
2 E; 


(9.2) 
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The error of the combined strong classifier on the training data approaches zero 
exponentially fast in the number of iterations if the weak classifiers do at least slightly 
better than random guessing. The error rate of the strong classifier will improve if any 
of the weak classifiers improve. This is in stark contrast with earlier boosting, bagging, 
and other ensemble methods. 

If the Vapnik-Chervonenkis (VC) dimension of the weak classifiers is d > 2, then, 
after T iterations, the VC dimension of the strong classifier will be at most 


2(d + 1)(T + 1) logy (e(T + 1). (9.3) 


Given the VC dimension, we may derive a formal upper limit on the generalization 
error with Vapnik’s theorem, and hence establish the optimal number of iterations. 


9.3 A Family of Convex Boosters 


AdaBoost is a minimization of a convex loss function over a convex set of functions. 
The loss being minimized is exponential: 


Loi, f (i) = e, (9.4) 


and we are seeking a function 


fO = Yo wh). (9.5) 
t 


Finding the optimum is equivalent to a coordinate-wise gradient descent through 
a greedy iterative algorithm: it chooses the direction of steepest descent at a given 
step (Friedman et al., 2000). 

Following this thought, we can, in fact, define a family of boosters by changing the 
loss function in Equation 9.4 (Duffy and Helmbold, 2000). 

A convex loss will incur some penalty to points of small positive margin—that 
is, points that are correctly classified and close to the boundary. This is critical 
to obtaining a classifier of maximal margin, which ensures good generalization 
(Section 2.4). A good convex loss function will assign zero penalty to points of large 
positive margin, which are points correctly classified and that are also far from the 
boundary. A convex loss function applies a large penalty to points with large negative 
margin—that is, points which are incorrectly classified and are far from the boundary 
(Figure 9.1). 

Thus, it is easy to see that any convex potential—a nonincreasing function in C! 
with a limit zero at infinity—is bound to suffer from classification noise (Long and 
Servedio, 2010). A label that is incorrect in the training data will incur a large negative 
margin, and it will pull the decision surface from away the optimum, leading to a 
distorted classifier. 

Unbounded growth of negative margins is the key problem, explaining why the 
performance of AdaBoost is poor in the presence of label noise: its exponential label 
noise is oversensitive to mislabeling (Dietterich, 2000). 
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Figure 9.1 Convex loss functions compared with the optimal 0-1 loss. 


The convex nature of the cost function enables the optimization to be viewed as a 
gradient descent to a global optimum (Mason et al., 1999). For instance, the sigmoid 
loss will lead to a global optimum: 


1 
LOi,f Qi) = z H — tanhQye&i))1. (9.6) 


Higher values of A yield a better approximation of the 0-1 loss. Similarly, 
LogitBoost replaces the loss function with one that grows linearly with the negative 
margin, improving sensitivity (Friedman et al., 2000): 


LO inf (Xi) = log (1 + 4%). (9.7) 


While changing the exponential loss to an alternative convex function will improve 
generalization performance, a gradient descent will only give a linear combination 
of weak learners. Sparsity is not considered. To overcome this problem, we may 
regularize boosting similarly to limit the number of steps in the iterations (Zhang 
and Yu, 2005) or use a soft-margin like in support vector machines (Rätsch et al., 
2001). 

A simple form of regularization is an early stop—that is, limiting the number of 
iterations the boosting algorithm can take. It is difficult to find an analytical limit 
when the training should stop. Yet, if the iterations continue to infinity, AdaBoost will 
overfit the data, including mislabeled examples and outliers. 

Another simple form of regularization is by shrinkage, in which the contribution 
of each weak learner is scaled by a learning rate parameter 0 < v < 1. Shrinkage pro- 
vides superior results to restricting the number of iterations (Copas, 1983; Friedman, 
2001). Smaller values of v result in a larger training risk in the same number of 
iterations. At the price of a larger number of iterations, a small learning rate leads 
to better generalization error. 

To deal with noisy labels, we can introduce soft margins as in the case of 
support vector machines (Rätsch et al., 2001). Without such regularization, boosting 
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is sensitive to even a low number of incorrectly labeled training examples (Dietterich, 
2000). If we use Lz regularization, we maintain the convexity of the objective function: 


L(w) + Allwll’, (9.8) 


where L(w) = we L(yi, f (Xi)), and dependence on w is implicit through f. This is 
not unlike the primary formulation of support vector machines with Ly regularization 
of soft margins in the primal form. The difference is that support vector machines 
consider quadratic interactions among the variables. This picture changes in QBoost, 
the boosting algorithm suitable for quantum hardware, where quadratic interactions 
are of crucial importance (Section 14.2). 

In LPBoost (Demiriz et al., 2002), the labels produced by the weak hypotheses 
create a new feature space. This way, we are no longer constrained by sequential 
training. All weights are modified in an iteration; this method is also known as totally 
corrective boosting. The objective function is regularized by a soft margin criterion. 
The model is more sensitive to the quality of weak learners, but the final strong 
classifier will be sparser. Overall performance and computational cost are comparable 
to those of AdaBoost. 

The ideal nonconvex loss function, the 0-1 loss (Section 7.8), remains robust even 
if the labels are flipped in up to 50% of the training instances, provided that the 
classes are separable (Manwani and Sastry, 2013). None of the convex loss functions 
described above handle mislabeled instances well. Hence, irrespective of the choice 
of convex loss, the limits established by Long and Servedio (2010) apply, which is a 
strong incentive to look at nonconvex loss functions. 


9.4 Nonconvex Loss Functions 


SavageBoost bypasses the problem of optimizing the objective function containing 
the nonconvex loss (Masnadi-Shirazi and Vasconcelos, 2008). Instead, it finds the 
minimum conditional risk, which remains convex—the convexity of the loss function 
is irrelevant to this optimization. The conditional risk is defined as 


CLN. f) = nlf) + A — LCP). (9.9) 


The minimum conditional risk 
Cin) = int CLF) = Co(n. fp) (9.10) 


must satisfy two properties. First, it must be a concave function of 7 € [0, 1]. Then, 
if ff is differentiable, C} (n) is also differentiable and, for any pair (v, ñ) such that 


v =fr Â), 
Cr(n, v) — C} (n) = B-c; (n, ñ), (9.11) 


where 


Br (n, ñ) = Fn) — F) — (n — F(A) (9.12) 
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is the Bregman divergence of the convex function F. This second condition provides 
insight into learning algorithms as methods for the estimation of the class posterior 
probability n(x). The optimal function f(x) that minimizes the conditional risk 
in Equation 9.9 is equivalent to a search for the probability estimate (x) which 
minimizes the Bergman divergence of —C7 in Equation 9.11. 

Minimizing the Bergman divergence imposes no restrictions on the convexity of 
the loss function. Masnadi-Shirazi and Vasconcelos (2008) derived a nonconvex loss 
function, the Savage loss: 


1 
(d + e2vf(xi))2° 
The Savage loss function quickly becomes constant as m — —oo, making it more 
robust to outliers and mislabeled instances. The corresponding SavageBoost algorithm 
is based on a gradient descent, and it is not totally corrective. SavageBoost converges 
faster than convex algorithms on select data sets. 


TangentBoost for computer vision uses a nonconvex loss function (Masnadi- 
Shirazi et al., 2010): 


LO. f (i) = [2 arctan(yif (xi)) — IP. (9.14) 


It was designed to retain the desirable properties of a convex potential loss: it is 
margin-enforcing with small penalty for correctly classified instances close to the 
decision boundary. Additionally, it has a bounded penalty for large negative margins. 
In this sense, it resembles Savage loss. Yet, it also penalizes points with large positive 
margin. This property might be useful in improving the margin, as points further 
from the boundary still influence the decision surface. In selected computer vision 
tasks, tangent loss performed marginally better than Savage loss. Since tangent loss is 
nonconvex, an approximate gradient descent estimates the optimum using the Gauss 
algorithm. Some nonconvex loss functions are shown in Figure 9.2. 


Loi, f &i)) = (9.13) 
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Figure 9.2 Nonconvex loss functions compared with the optimal 0-1 loss. 
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Clustering Structure and 
Quantum Computing 


Quantum random access memory (QRAM) allows the storing of quantum states, 
and it can be queried with an address in superposition (Section 10.1). Calculating 
dot products and a kernel matrix relies on this structure (Section 10.2). These two 
components are not only relevant in unsupervised learning, they are also central to 
quantum support vector machines with exponential speedup (Section 12.3). 

Quantum principal component analysis is the first algorithm we discuss to use 
QRAM: it retrieves quantum states which perform a self-analysis: an eigendecom- 
position of their own structure (Section 10.3). Quantum manifold learning takes this a 
step further, by initializing a data structure with geodesic distances (Section 10.4). 

The quantum K-means algorithm may rely on classical or quantum input states; 
in the latter case, it relies on a QRAM and the quantum method for calculating dot 
products (Section 10.5). Quantum K-medians emerges from different ideas and it 
relies on Grover’s search to deliver a speedup (Section 10.6). Quantum hierarchical 
clustering resembles this second approach (Section 10.7). 

We summarize overall computational complexity for the various approaches in the 
last section of this chapter (Section 10.8). 


10.1 Quantum Random Access Memory 


A random access memory allows memory cells to be addressed in a classical 
computer: it is an array in which each cell of the array has a unique numerical address. 
A QRAM serves a similar purpose (Giovannetti et al., 2008). 

A random access memory has an input register to address the cell in the array, and 
an output register to return the stored information. In a QRAM, the address and output 
registers are composed of qubits. The address register contains a superposition of 
addresses 2 j Pjlj)a, and the output register will contain a superposition of information, 
correlated with the address register: }~ PjlialDj)a- 

Using a “bucket-brigade” architecture, a QRAM reduces the complexity of retriev- 
ing an item to O(log 2”) switches, where n is the number of qubits in the address 
register. 

The core idea of the architecture is to have qutrits instead of qubits allocated in each 
node of a bifurcation graph (Figure 10.1). A qutrit is a three-level quantum system. 
Let us label the three levels |wait), |left), and |right). During each memory call, each 
qutrit is in the |wait) state. 
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Figure 10.1 A bifurcation graph for QRAM: the nodes are qutrits. 


The qubits of the address register are sent through the graph one by one. The | wait) 
state is transformed into |left) and |right), depending on the current qubit. If the state 
is not in |wait), it routes the current qubit. The result is a superposition of routes. Once 
the routes have thus been carved out, a bus qubit is sent through to interact with the 
memory cells at the end of the routes. Then, it is sent back to write the result to the 
output register. Finally, a reverse evolution on the states is performed to reset all of 
them to |wait). 

The advantage of the bucket-brigade approach is the low number of qutrits involved 
in the retrieval: in each route of the final superposition, only log N qutrits are not in 
the |wait) state. The average fidelity of the final state if all qutrits are involved in the 
superposition is O(1 — € log N) (Giovannetti et al., 2008). 


10.2 Calculating Dot Products 


In the quantum algorithm to calculate dot products, the training instances are presented 
as quantum states |x;). We do not require the training instances to be normalized, but 
the normalization must be given separately. To reconstruct a state from the QRAM, 
we need to query the memory O(log N) times. 

To evaluate the dot product of two training instances, we need to do the follow- 
ing (Lloyd et al., 2013a): 


e Generate two states, |y) and |Ø), with an ancilla variable; 

+ Estimate the parameter Z = ||x;||? + Ixl? the sum of the squared norms of the two in- 
stances; 

e Perform a projective measurement on the ancilla alone, comparing the two states. 


Z times the probability of the success of the measurement yields the square of the 
Euclidean distance between the two training instances: ||x; — x;||?. We calculate the 


: ” Z—||x;—x;||7 
dot product in the linear kernel as x) xj = ae 


The state |y) = z (0) [x;) + |1)|x;)) is easy to construct by querying the QRAM. 
We estimate the other state 
1 


1b) = z Clx: — Ixilll1)), (10.1) 
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and the parameter Z together. We evolve the state 
1 
v2 


with the Hamiltonian 


(10) — |1)) & 10) (10.2) 


H = (11o (0| + piae) @ Ox. (10.3) 
The resulting state is 


l eos(x:1010) — cos(lix;l1)11)] & 10) 


— SL sin(sill910) — sin(||xjl|)|1)] 8 11). 

By appropriate choice of t (||x;llt, |[xjllt < 1), and by measuring the ancilla bit, we 
get the state @ with probability ize, which in turns allows the estimation of Z. If the 
desired accuracy of the estimation is €, then the complexity of constructing |@) and Z 
is O(e7!). 

If we have |y) and |ġ), we perform a swap test on the ancilla alone. A swap test 
is a sequence of a Hadamard gate, a Fredkin gate and another Hadamard gate which 
checks the equivalence of two states using an ancilla state (Buhrman et al., 2001). The 
circuit is shown in Figure 10.2. 

The first transformation swaps |y) and |@) if the ancilla bit is |1). The Hadamard 
transformation is a one-qubit rotation mapping the qubit states |0) and |1) to 
two superposition states. More formally, the overall state after the first Hadamard 
transformation is 


la)l¥)1b) = 


(10.4) 


1 
/2 
We apply the Fredkin gate: 

1 
V2 
Then, we apply the Hadamard gate on the first qubit: 


1 


(10) + DIY) = Z 


(10)|¥)1b) + IDI) le)). (10.5) 


(10)|¥)1b) + [1)I@)IW)). (10.6) 


1 
5 EO) + 11)) 1A) + (10) — 11)) 16) 1) J (10.7) 
l0) 4] uH A 


Figure 10.2 Quantum circuit of a swap test. 
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We rearrange the components for |0) and |1) : 


1 
z fio (w) + onw) ) + (iw) = on) (10.8) 


With this algebraic manipulation, we conclude that if |y) and |@) are equal, then the 
measurement in the end will always give us zero. 

With the QRAM accesses and the estimations, the overall complexity of evaluating 
a single dot product x) x; is O(e7! log N). Calculating the kernel matrix is straight- 
forward. As the inner product in this formulation derives from the Euclidean distance, 
the kernel matrix is also easily calculated with this distance function. 

Further generalization to nonlinear metrics approximates the distance function 
by qth-order polynomials, where we are given q copies of |x;) and x;) (Harrow 
et al., 2009; Lloyd et al., 2013a). Using quantum counting for the q copies, we 
evaluate ((xil (xj1) Lx) Ix) for an arbitrary Hermitian operator L to get the 
approximation. Measuring the expectation value of L to accuracy € has O(e7!q log N) 
complexity. 

A similar scheme uses pretty good measurements, which is just a form of 
positive operator—valued measures, to estimate the inner product and the kernel 
matrix (Gambs, 2008). This variant requires @(e—!N) copies of each state. 


10.3 Quantum Principal Component Analysis 


Principal component analysis relies on the eigendecomposition of a matrix, which, in 
a quantum context, translates to simulating a Hamiltonian. The problem of simulating 
a Hamiltonian is as follows. Given the Hamiltonian of the form H = nl Hj, the task 
is to simulate the evolution e'™ by a sequence of exponentials e~ r, requiring that the 
maximum error in the final state does not exceed some € > 0. We want to determine 
an upper bound on the number of exponentials required in the sequence (Berry et al., 
2007). 

The idea of approximating by a sequence of exponentials is based on the Baker- 
Campbell-Hausdorff formula, which states that 


eZ = eX el, (10.9) 


where Z = log(e*e’), and X and Y are noncommuting variables. A consequence of 
this is the Lie product formula 
eXt¥ = lim (e%/Mer/2y", (10.10) 
nC 
Thus, to simulate the Hamiltonian, we can calculate 
elt x (etit/n co. elmt/ny” (10.11) 


If the component Hamiltonians H; act only locally on a subset of variables, calculating 
eiit is more efficient. If the dimension of the Hilbert space that Hj acts on is d;, then 
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the number of operations to simulate elit is approximately d? (Lloyd, 1996). If the 
desired accuracy is €, it will be bounded by 


m 
exn| Y E | <nmd’, (10.12) 
j=l 


where d = max; di. This is the error term for the component Hamiltonians, and 
adjustment of €/(nmd?) is necessary to have an overall error of e. Thus, if the 
Hamiltonians are local, to simulate the system, the time complexity is linear. 

Let us consider arbitrary, but sparse Hamiltonians—that is, systems where the 
interactions of the components may not be local. A Hamiltonian H acting on n 
qubits is sparse if it has at most a constant number of nonzero entries in each row 
or column, and, furthermore, H is bounded by a constant. In this case, we can select 
an arbitrary positive integer k such that the simulation of the Hamiltonian requires 
O((log* n)ti+1/2k) accesses to the matrix entries of H (Berry et al., 2007). Here log* n 
is the iterated logarithm: 


B ifn <1, 
~ |1+log*(ogn) ifn>1, 


* 


log“ n (10.13) 
that is, the number of times the logarithm function must be applied recursively before 
the result is less than or equal to 1. While the complexity is close to linear, sublinear 
scaling of the sparse simulation is not feasible (Berry et al., 2007). 

To simulate nonsparse, arbitrary d-dimensional Hamiltonians, the typical method 
is the higher-order Trotter-Suzuki expansion, which requires O(d log d) time for a 
component Hamiltonian (Wiebe et al., 2010). This can be reduced to just O(log d) 
by applying a density matrix p as a Hamiltonian on another density matrix ø (Lloyd 
et al., 2013b). Using the partial trace over the first variable and the swap operator S, 
we get 
—iSAt 


SAN) _ (cos? Ado + (sin? At)p — isin Arlo, c] 


=o —iAt[p,o] + O(AP). (10.14) 


trp(e poce 


Since the swap operator S is sparse, it can be performed efficiently. Provided we 
have n examples of p, we repeat swap operations on p ® o to construct the unitary 
operator e~'?4', and thus we simulate the unitary time evolution e~!?4’ce!?4'. To 
simulate e~!?! to accuracy €, we need O(t*€~!||o — ol?) < O(t?e7') steps, where 
t = nAt and ||.|| is the supremum norm. 

To retrieve p, we use a QRAM with O(log d) operations. Using O(te—!) copies of 
p, we are able to implement e'”’ to accuracy € in time O(1*/e log d). 

As the last step, we perform a quantum phase estimation algorithm using e!?’. For 
varying times t, we apply this operator on p itself, which results in the state 


Yo rlx (xa 8 IF) Fil, (10.15) 


where | x;) are the eigenvectors of p, and 7; are estimates of the matching eigenvalues. 
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10.4 Toward Quantum Manifold Embedding 


While quantum principal component analysis is attractive, more complex low- 
dimensional manifolds are also important for embedding high-dimensional data 
(Section 5.2). Isomap is an example that relies on local density: it approximates the 
geodesic distance between two points by the length of the shortest path between these 
two points on a neighborhood graph, then applies multidimensional scaling on the 
Gram matrix. 

Constructing the neighborhood graph relies on finding the c smallest values of the 
distance function with high probability, using Grover’s search. The overall complexity 
for finding the smallest values is O(/cN). This is equivalent to finding the smallest 
distances between a fixed point and the rest of the points in a set—that is, the neighbors 
of a given point (Aimeur et al., 2013). Since there are a total of N points, the overall 
complexity of constructing the neighborhood graph is O(N V/cN). 

While calculating the geodesic distances is computationally demanding, the eigen- 
decomposition of an N x N matrix in the multidimensional scaling is the bottleneck. 
The Gram matrix is always dense, and the same eigendecomposition applies as for 
quantum principal component analysis. Thus, for this step alone, an exponential 
speedup is feasible, provided that the input and output are quantum states. 


10.5 Quantum K-Means 


Most quantum clustering algorithms are based on Grover’s search (Aimeur et al., 
2013). These clustering algorithms offer a speedup compared with their classical 
counterparts, but they do not improve the quality of the resulting clustering process. 
This is based on the belief that if finding the optimal solution for a clustering problem 
is NP-hard, then quantum computers would also be unable to solve the problem 
exactly in polynomial time (Bennett et al., 1997). If we use QRAM, an exponential 
speedup is possible. 

The simplest quantum version of K-means clustering calculates centroids and 
assigns vectors to the closest centroids, like the classical variant, but using Grover’s 
search to find the closest ones (Lloyd et al., 2013a). Since every vector is tested in 
each step, the complexity is O(N log(Nd)). 

Further improvement is possible if we allow the output to be quantum. Every 
algorithm that returns the cluster assignment for each N output must have at least 
O(N) complexity. By allowing the result to be a quantum superposition, we remove 
this constraint. The output is a quantum state: 


Ix) = (/VN) > lel) = G/VN) ¥2 edb). (10.16) 
J c jec 
If necessary, state tomography on this reveals the clustering structure in classical 
terms. Constructing this state to € accuracy has O(e~!K log(KNd)) complexity if we 
use the adiabatic theorem. If the clusters are well separated, the complexity is even 
less, O(e7! log(KNd)), as the adiabatic gap is constant (Section 4.3). 
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To construct the state in Equation 10.16, we start by selecting K vectors with labels 
ic as initial seeds. We perform the first step of assigning the rest of the vectors to these 
clusters through the adiabatic theorem. We use the state 


@D 
1 1 
Sag 22 eN) (Eto) (10.17) 
cj c 


where the D copies of the state K >=. le)lic) enable us to evaluate the distances 


|x; — Xi. |? with the algorithm outlined in Section 10.2. The result is the initial 
clustering 
Wi) = lc) (10.18) 
ED 


c jec 


where the states |j) are associated with the cluster c with the closest seed vector x;.. 
Assume that we have D copies of this state. Then, we can construct the individual 
clusters as |5) = (1/ JM.) X kec Xk, and thus estimate the number of states Me in the 
cluster c. 
With D copies of y1, together with the clusters |¢{), we are able to evaluate the 
average distance between x; and the mean of the individual clusters: 


es 


Me kec 


= |I% — xel. (10.19) 


Next, we apply a phase to each component |c’)|j) of the superposition with the 
following Hamiltonian: 


He = So lix- xe PIKIS Ul @ 12. (10.20) 
ej 


We start the adiabatic evolution on the state (1/ VNK) } v; j Ic’) |j)|W1)®2. The base 
Hamiltonian is Hy = Z — |¢)(¢|, where |@) is the superposition _ cluster centroids. 
The final state is 


IS yew Wi)? = |W2)\W1)®?. (10.21) 


c jed 


Repeating this D times, we create D copies of the updated state |y2). We repeat 
this cluster assignment step in subsequent iterations, eventually converging to the 
superposition |x) in Equation 10.16. 


10.6 Quantum K-Medians 


In K-means clustering, the centroid may lie outside the manifold in which the 
points are located. A flavor of this family of algorithms, K-medians, bypasses this 
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problem by always choosing an element in a cluster to be the center. We achieve 
this by constructing a routine for finding the median in a cluster using Grover’s 
search, and then iteratively calling this routine to reassign elements (Aimeur et al., 
2013). 

Assume that we establish the distance between vectors either with the help of an 
oracle or with a method described in Section 10.2. To calculate the distance from a 
point x; to all other points one d(x;, Xj)), we repeatedly call the distance calculator. 
Applying the modified Grover’s search of Durr and Hoyer (1996) to find the minimum, 
we identify the median among a set of points in O(N/N) time. 

We call this calculation of the median in each iteration of the quantum K- medians, 
and reassign the data points accordingly. The procedure is outlined in Algorithm 4. 


ALGORITHM 4 Quantum K-medians 
Require: Initial K points 
Ensure: Clusters and their medians 
repeat 
for all x; do 
Attach to the closest center 
end for 
for all K cluster do 
Calculate median for cluster 
end for 
until Clusters stabilize 


10.7 Quantum Hierarchical Clustering 


Quantum hierarchical clustering hinges on ideas similar to those of quantum K- 
medians clustering. Instead of finding the median, we use a quantum algorithm to 
calculate the maximum distance between two points in a set. We iteratively call 
this algorithm to split clusters and reassign the data instances to the most distant 
pair of instances (Aimeur et al., 2013). This is the divisive form of hierarchical 
clustering. 

As in Section 10.6, finding the maximum distance between a set of points relies on 
modifying the algorithm of Durr and Hoyer (1996). We initialize a maximum distance, 
dmax, aS zero. Then, we repeatedly call Grover’s search to find indices i and j, such 
that d(x;, Xj) > dmax, and update the value of dmax. The iterations stop when there are 
no new i, j index pairs. 

Using the last pair, we attach every instance in the current cluster to a new cluster 
Ci or Cj, depending on whether the instance is closer to x; or xj. Algorithm 5 outlines 
the process. 
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The same problems remain with the quantum variant of divisive clustering as with 
the classical one. It works if the clusters separate well and they are balanced. Outliers 
or a blend of small and large clusters will skew the result. 


ALGORITHM 5 Quantum divisive clustering 
Require: A cluster C 
Ensure: Further clusters in C 
if Maximum depth reached then 
return C 
end if 
Find most distant points xi and xj; in C 
for all xeEC do 
Assign x to a new cluster Cy or Cj based on proximityx; or xj. 
end for 
Recursive call on Ci and Cy 


10.8 Computational Complexity 


The great speedups in quantum unsupervised learning pivot on quantum input and 
output states. Quantum principal component analysis with QRAM, if take some unit 
time for the evolution of the system, has a complexity of O(e~!logd), where € 
is the desired complexity; this is an exponential speedup over any known classical 
algorithm. 

The quantum K-means algorithm with quantum input and output states has a 


complexity of O(log Nd), which is an exponential speedup over the polynomial 
tN3/2 


VK 


2 i ; : : f 
opposed to O (4) of the classical algorithm, where ¢ is the number of iterations. 


complexity classical algorithms. Quantum K-medians has o( ) complexity as 


Quantum divisive clustering, O(N log N), has much lower complexity compared with 
the classical limit @(N7). We must point out, however, that classical sublinear clus- 
tering in a probably approximately correct setting exists, using statistical sampling. 
This method reduces the complexity to O(log” N) (Mishra et al., 2001). Sublinear 
clustering algorithms are not exclusive to quantum systems. 
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Quantum Pattern Recognition 


Artificial neural networks come in countless variants, and it is not surprising that 
numerous attempts aim at introducing quantum behavior into these models. While 
neurons and connections find analogues in quantum systems, we can build models that 
mimic classical neural networks, but without using the neural metaphor. For instance, 
a simple scheme to train the quantum variant of Hopfield networks is to store patterns 
in a quantum superposition; this is the model of quantum associative memories. To 
match a pattern, we use a modified version of Grover’s search. Storage capacity is 
exponentially larger than in classical Hopfield networks (Section 11.1). 

Other models do include references to classical neural networks. Neurocomputing, 
however, is dependent on nonlinear components, but evolution in quantum systems is 
linear. Measurement is a key ingredient in introducing nonlinearity, as the quantum 
perceptron shows (Section 11.2). 

Apart from measurement, decoherence paves the way to nonlinear patterns, and 
Feynman’s path integral formulation is also able to introduce nonlinearity. Feedfor- 
ward networks containing layers and multiple neurons may mix quantum and classical 
components, introducing various forms of nonlinear behavior (Section 11.3). 

Multiple attempts show the physical feasibility of implementing quantum neural 
networks. Nuclear magnetic resonance and quantum dot experiments are promising, 
albeit using only a few qubits. There are suggestions for optical and adiabatic 
implementations (Section 11.4). 

While we are far from implementing deep learning networks with quantum 
components, the computational complexity is, not surprisingly, much better than with 
classical algorithms (Section 11.5). 


11.1 Quantum Associative Memory 


A quantum associative memory is analogous to a Hopfield network (Section 6.2), 
although the quantum formulation does not require a reference to artificial neurons. 
A random access memory retrieves information using addresses (Section 10.1), 
whereas recall from an associative memory requires an input pattern instead of an 
address. The pattern may be a complete match to one of the patterns stored in the 
associative memory, but it can also be a partial pattern. The task is to reproduce the 
closest match. 
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If the dimension of a data instance is d, and we are to store N patterns, a classical 
Hopfield network would require d neurons and the number of patterns that can be 
stored is linear in d (Section 6.2). The quantum associative memory stores patterns in 
a superposition, offering a storage capacity of O(22), using only d neurons (Ventura 
and Martinez, 2000). Here we follow the exposition in Trugenberger (2001), which 
extends and simplifies the original formulation by Ventura and Martinez (2000). 

The associative memory is thus a superposition with equal probability for all of its 
entangled qubits: 


1 N 
=—)" lxi), 11.1 
|M) JN & |xi) (11.1) 


where each |x;) state has d qubits. The first task is to construct the superposition |M}. 
We use auxiliary registers to do so. 

The first register x of d qubits will temporarily hold the subsequent data instances 
xi. The second register is a two-qubit utility register u. The third register is the memory 
m of d qubits. With these registers, the initial quantum state is 


lyd) = ler... -,x14;01;0,...,0). (11.2) 


The upper index of Ivo) is the current data instance being processed, and the lower 
index is the current step in storing the data instance. There will be a total of seven 
steps. 

We separate this state into two terms, one corresponding to the patterns already 
in the memory, and the other processing a new pattern. The state of the utility qubit 
distinguishes the two parts: |0} for the stored patterns, and |1) for the part processed. 

To store x;, we copy the pattern into the memory register with 


d 


Iwi) = | [2XORsju2m, W). (11.3) 


j=l 


Here 2XOR is a Toffoli gate (Section 4.2), and the lower index indicates which qubits 
it operates on. 

Then we flip all qubits of the memory register to |1) if the contents of the pattern 
and the memory registers are identical, which is true only for the processing term: 


d 


Iyi) = | [NOTn,XORjm;l¥)- (11.4) 
j=1 


The third operation changes the first utility qubit to |1): 
Iyi) = nXOR mi -mgu WÈ). (11.5) 
The fourth step uses a two-qubit gate: 


Cs! = diag(I, SÌ, (11.6) 
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where 


el. 2. 
Six 1 ioj. (11.7) 


1 ji 


Vi 1 


With this gate, we separate the new pattern to be stored, together with the correct 
normalization factor: 


Iyi) = Cor yi). (11.8) 


We perform the reverse operations of Equations 11.4 and 11.5, first restoring the utility 
qubit u1, 


Iyi) = nXOR mi -mau Ws (11.9) 


and then restoring the memory register m to their original values, 


1 


Ivi) = | [ XOR xm; NOT m |W). (11.10) 
j=d 


After this step, we have the following state: 


oe xi, 00; xx) + ,/ ——|p'; 01; . ; 
° Nia N P P 


In the last step, we restore the third register in the second term of Equation 11.11 
to |0): 
1 
Iyi) = | [ 2XOR sum, hi). (11.12) 
j=d 


A second look at the superposition state |M) in Equation 11.1 highlights an 
important characteristic: the stored states have equal weights in the superposition. 
Grover’s algorithm works on such uniformly weighted superpositions (Equation 4.25). 
Yet, if we wish to retrieve a state given an input, the original Grover’s algorithm 
as outlined in Section 4.5 will not work efficiently: the probability of retrieving the 
correct item will be low. Furthermore, Grover’s algorithm assumes that all patterns of 
a given length of qubits are stored, which is hardly the case in pattern recognition. 

We modify Algorithm | to overcome this problem (Ventura and Martinez, 2000). 
The modified variant is outlined in Algorithm 6. Zhou and Ding (2008) suggested a 
similar improvement. 

The second state rotation operator rotates the phases of the desired states, moreover, 
and it rotates the phases of all the stored patterns as well. The states not matching 
the target pattern are present as noise, and the superposition created by applying 
Hadamard transforms also introduces undesirable states. The extra rotations in the 
modified algorithm force these two different kinds of nondesired states to have the 
same phase, rather than opposite phases as in the original algorithm. The eventual 
state then serves as the input to the normal loop of Grover’s algorithm. 
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ALGORITHM 6 Grover’s algorithm modified for pattern recognition 
Require: Initial state in equal superposition |M} = oq Dien lxi). the 
input state to be matched through a corresponding oracle O. 
Ensure: Most similar element retrieved. 
Apply the Grover operator on the memory state: |M) =G|M). 
Apply an oracle operator for the entire stored memory state 
|M) = Ov|M). 
Apply the rest of the Grover operator: |M} = H®"(2|0)(0| — I)H®”|m). 
for O(./N) times do 
Apply the Grover operator G 
end for 
Measure the system. 


A similar strategy is used in quantum reinforcement learning. Distinct from 
supervised learning, input and output pairs are associated with a reward, and the 
eventual goal is to take actions that maximize the reward. Actions and states are 
stored in a superposition, and Grover’s search amplifies solutions that have a high 
reward (Dong et al., 2008). 

To find a matching pattern in the quantum associative memory, we perform a prob- 
abilistic cloning of the memory state |M}, otherwise we would lose the memory state 
after retrieving a single item (Duan and Guo, 1998). The number of elements stored 
can be exponential, which makes the time complexity of Grover’s algorithm high. 

An alternative method splits retrieval into two steps: identification and recogni- 
tion (Trugenberger, 2001). We must still have a probabilistic clone of the memory 
state. We use three registers in the retrieval process: the first register contains 
the input pattern, the second register stores |M), and there is a control register 
c initialized to the state (|0) + |1)) /V2. The initial state for the retrieval process 
is thus 


N 
1 
Ivo) = == ) _lit,-- +s tnd Xk, - -> Xkn; 0) 
Nei 
= (11.13) 
1 & 
+= iessen Xis Men lL): 
J2N 2 " ” 
We flip the memory registers to |1) if i; and xx; are identical: 
d 
Wi) = | [NOT mXORim | Vo). (11.14) 
k=1 


We want to calculate the Hamming distance (Equation 2.3) between the input 
pattern and the instances in the memory. To achieve this, we use the following 
Hamiltonian: 


H = dm (Oz)e; 
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ef oe =) 
dm = - ; 
where o; is the third Pauli matrix (Equation 4.3). This Hamiltonian measures the 
number of 0’s in register m with a plus sign if c is |0), and the number with a minus 
sign if c is in |1). 
The terms of the superposition in|y1) are eigenstates of H. Applying the corre- 
sponding unitary evolution (Equation 3.41), we get 


N 
1 
|W2) = = el POM iy. ia; bris- , bra; 0) 
ON i=l 11.15 
N (11.15) 
1 
+e) POM i, ia; Dia, «Dias 1), 
2N jai 
where bj, = 1 if and only if ij = x4;. 
As the last step, we perform the inverse operation of Equation 11.14: 
1 
|W3 = He | [ XORimNOTml W2), (11.16) 


k=d 


where He is the Hadamard gate on the control qubit. We measure the system in the 
control qubit, obtaining the probability distribution: 


N 
1 
P(\c) = |0)) = X = cos? (Zdi) , (11.17) 
k=1 
P(\c) = |1)) al sin? (= d(i,x )) (11.18) 
= = — —d(i, xk) ) . : 
LN 2d 


Thus, if the input pattern is substantially different from every stored pattern, the 
probability is higher of measuring |c) = |1). If the input pattern is close to every stored 
pattern, the probability of measuring |c} = O will be higher. Repeating the algorithm 
gives an improved estimate. 

Once a pattern has been recognized, we proceed to measure the memory register to 
identify the closest state. The probability of obtaining a pattern x, is 


1 a(t ,. 
PO) = npg Soy (Zdi) (11.19) 
The probability peaks around the patterns which have the smallest Hamming distance 
to the input. 

Table 11.1 provides a high-level comparison of classical Hopfield networks and 
quantum associative memories. 

The recognition efficiency relies on comparing cosines and sines of the same 
distances in the distribution, whereas the identification efficiency relies on comparing 
cosines of the different distances in the distribution. Identification is most efficient 
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Table 11.1 Comparison of Classical Hopfield Networks 
and Quantum Associative Memories (Based 
on Ezhov and Ventura, 2000) 


Classical Hopfield Network Quantum Associative Memory 


Connections wi Entanglement |x, - -+ Xka) 


Learning rule Da XkiXkj Superposition of entangled states 
N 
ope ele ++ + Xka) 
Winner search n = argmax,(f;) Unitary transformation |y) > |p’) 
Output result n Decoherence Do ak|Xk) => |Xn) 


when one of the distances is zero, and all others are large, making the probability peak 
on a single pattern. This is the opposite scenario for optimal recognition efficiency, 
which prefers uniformly distributed distances. Increasing the number of control qubits 
from 2 to d improves the efficiency of identification (Trugenberger, 2002). 


11.2 The Quantum Perceptron 


As in the case of classical neural networks, the perceptron is the simplest model 
of a feedforward model. The quantum variant of a perceptron relies on the unitary 
evolution of a density matrix (Lewenstein, 1994). The quantum perceptron takes an 
input state p', and produces an output p°™ = Up!" UÏ (see Equation 3.42). The output 
density matrix is then subject to measurement to introduce nonlinearity. 

Given N patterns to learn, we prepare the states p”, i=1,...,N with projection 
operators pe, If pọ is the density matrix, the preparation of the system consists of 
stating the ith input question—that is, the measurement and renormalization 


pit = P oP /tr(P™ oP"). (11.20) 

The output states are defined similarly with corresponding projection operators 
age These measurements ensure a nonlinear behavior in the system. 

Let us define two local cost functions on the individual data instances. Let E; denote 


the conditional probability that we did not find the state in the ith state, although it was 
in the ith state: 


Ei = tr(QS“UP™ poP Ut QS") /tr(P™ po P**), (11.21) 


where Q = 1 — P. Analogously, F; is the conditional probability of finding the system 
in the ith state, whereas it was not in the ith state: 


F; = tr(P?"'UQ? po QU" PS") /tr(O!" po Q)"). (11.22) 


With these two measures of error, we define a global cost function 


E= ae (11.23) 
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Rather than asking which unitary transformation U would minimize the cost function 
in Equation 11.23, which would be similar to the task set out in Chapter 13 relating 
quantum process tomography, we are interested in restricting the set of unitaries, 
and asking what is the probability of finding one given an accepted probability of 
error (Gardner, 1988). 

Assume that the inputs and outputs are independent, and that the pi” operators are 
one-dimensional. Furthermore, assume that the pes operators are D-dimensional. The 
ratio D/d will determine the learning capacity of the quantum perceptron. 

When the ratio D/d is too large, the constraint on the error bound on F; cannot be 
satisfied, and we have a trivial model that gives the answer yes to any input—that is, 
the perceptron does not perform any computation. 

If E; and F; are bounded by a and a’, respectively, and D/d is between a’ and 
1 — a, we always obtain a satisfactory response for any choice of U. In this case, one 
of the error terms must exceed 1/2; hence, practically no information is stored in the 
perceptron. 

As the ratio D/d becomes smaller, we approximate a more standard form of 
learning. The error cannot be arbitrarily small, as it is bounded from below by 


3b? 


(+b+V14b) — 3B? 


where b = 3(1 — a). Finding an optimal U is possible, but Lewenstein (1994) does 
not show a way of doing this. As we are not concerned with additional restrictions on 
the unitary, theoretically the quantum perceptron has no limits on its storage capacity. 


D/d> (11.24) 


11.3 Quantum Neural Networks 


Classical networks have nonlinear irreversible dynamics, whereas quantum systems 
evolve in a linear, reversible way (Zak and Williams, 1998). How do we scale up 
nonlinearity from a single perceptron to a network of neurons? 

If we keep implementations in mind, quantum dots are a candidate for quantum 
neural networks (Section 11.4). This allows nonlinearity to enter through Feynman’s 
path integral formulation (Behrman et al., 1996). 

A commoner approach is to alternate classical and quantum components, in which 
the measurement and the quantum collapse introduce nonlinearity, as in the case of 
a single perceptron (Narayanan and Menneer, 2000; Zak and Williams, 1998). In 
fact, numerical simulations showed that a fully quantum neural network may produce 
worse results than a partly quantum one (Narayanan and Menneer, 2000). After each 
measurement, the quantum device is reset to continue from its eigenstate. To overcome 
the probabilistic nature of the measurements, several quantum devices are measured 
and reset simultaneously (Zak and Williams, 1998). 

The overall architecture of the network resembles the classical version, mixing 
layers of classical and quantum components (Narayanan and Menneer, 2000). Input 
nodes are replaced by slits through which quantum particles can travel. The particles 
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Figure 11.1 In a feedforward quantum neural network, quantum and classical components 
may mix. In this figure, the connections in the first and second layers are replaced by 
interference, but the edges between the second and third layers remain classical. 


undergo some unitary evolution, and an interference pattern is drawn on a measure- 
ment device (Figure 11.1). This interference pattern replaces weights, and this pattern 
is modified by Grover’s algorithm through the training procedure. 

The learning capacity of quantum neural networks is approximately the same 
as that of the classical variants (Gupta and Zia, 2001). Their advantage comes 
from a greater generalization performance, especially when training has low error 
tolerance, although these results are based on numerical experiments on certain data 
sets (Narayanan and Menneer, 2000). 


11.4 Physical Realizations 


Quantum dot molecules are nearby groups of atoms deposited on a host substrate. 
If the dots sufficiently close to one another, excess electrons can tunnel between the 
dots, which gives rise to a dipole. This is a model for a qubit, and, since it is based on 
solid-state materials, it is an attractive candidate for implementations. Quantum dots 
are easy to manipulate by optical means, changing the number of excitations. This 
leads to a temporal neural network: temporal in the sense nodes are successive time 
slices of the evolution of a single quantum dot (Behrman et al., 2000). If we allow a 
spatial configuration of multiple quantum dots, Hopfield networks can be trained. 

Optical realizations have also been suggested. Lewenstein (1994) discussed two 
potential examples for implementing perceptrons, a d-port lossless linear optical unit, 
and a d-port nonlinear unit. In a similar vein, Altaisky (2001) mooted phase shifters 
and beam splitters for linear evolution, and light attenuators for the nonlinear case. 
A double-slit experiment is a straightforward way to implement the interference model 
of feedforward networks (Narayanan and Menneer, 2000). Hopfield networks have a 
holographic model implementation (Loo et al., 2004). 
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Adiabatic quantum computing offers a global optimum for quantum associa- 
tive memories, as opposed to the local optimization in a classical Hopfield net- 
work (Neigovzen et al., 2009). A two-qubit implementation was demonstrated on 
a liquid-state nuclear magnetic resonance system. A quantum neural network of N 
bipolar states is represented by N qubits. Naturally, a superposition of “fire” and “not 
fire” exists in the qubits. The Hamiltonian is given by 


Ap = Amem + T Hinp, (11.25) 


where Hmem represents the knowledge of the stored pattern in the associative memory, 
Hinp represents the computational input, and F > 0 is an appropriate weight. 
The memory Hamiltonian is defined as the coupling strengths between qubits: 


Fem = a X wjožoș, (11.26) 
ižj 
where o;* is the Pauli Z matrix on qubit i, and w;j are the weights of the Hopfield 
network. 
For the retrieval Hamiltonian Hinp, it is assumed that the input pattern is of length 
N. If it is not, we pad the missing states with zero. The term is defined as 


Hinp = > hipan (11.27) 
j 


The external field defined by Hinp creates a metric that is proportional to the 
Hamming distance between the input state and the memory patterns. If the energy 
of the memory Hamiltonian Hinem is shifted, similar patterns will have lower energy 
(Figure 11.2). 

This scheme ignores training: it assumes that the memory superposition contains 
all configurations. This is in contrast with the learning algorithm described in 
Section 11.1. 
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Figure 11.2 In a quantum associative memory that relies on the adiabatic theorem, the patterns 
are stored in the stable states of the Hamiltonian Hmem. The input pattern is represented by a 
new Hamiltonian Hinp, changing the overall energy landscape, Hmem + Hinp. The ground state 
of the composite system points to the element to be retrieved from the memory. 
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11.5 Computational Complexity 


Quantum associative memory has a linear time complexity in the number of elements 
to be folded in the superposition and the number of dimensions (O(NVd); Ventura and 
Martinez, 2000). Assuming that N >> d, this means that computational complexity is 
improved by a factor of N compared with the classical Hopfield network. 

Quantum artificial neural networks are most useful where the training set is 
large (Narayanan and Menneer, 2000). Since the rate of convergence of the gradient 
descent in a classical neural network is not established for the general case, it is hard to 
estimate the overall improvement in the quantum case. The generic improvement with 
Grover’s algorithm is quadratic, and we can assume this for the quantum components 
of a quantum neural network. 


Quantum Classification 


Many techniques that made an appearance in Chapter 10 are useful in the 
supervised setting. Quantum random access memory (QRAM), the calculation of the 
kernel matrix, and the self-analysis of states provide the foundations for this chapter. 

The quantum version of nearest-neighbor classification relies either on a QRAM 
or on oracles. Apart from extending the case of the quantum K-means to labeled 
instances, an alternative variant compares the minimum distances across all elements 
in the clusters, yielding better performance (Section 12.1). 

A simple formulation of quantum support vector machines uses Grover’s search to 
replace sequential minimum optimization in a discretized search space (Section 12.2). 

Least-squares support vector machines translate an optimization problem into a set 
of linear equations. The linear equations require the quick calculation of the kernel 
matrix—this is one source of the speedup in the quantum version. The other source 
of the speedup is the efficient solution of the linear equations on quantum hardware. 
Relying on this formulation, and assuming quantum input and output space, quantum 
support vector machines can achieve an exponential speedup over their classical 
counterparts (Section 12.3). 

Generic computational complexity considerations are summarized at the end of the 
chapter (Section 12.4). 


12.1 Nearest Neighbors 


Nearest neighbors is a supervised cluster assignment task. Assume that there are two 
clusters, and they are represented by the centroid vector (1/M-) X jec Xj, Where c is the 
class. The task is to find the minimum distance for a new vector x such that the distance 
Ix — (1/M-) we xjl? is minimal. We use the algorithm in Section 10.2 to estimate 
this distance with a centroid vector. We construct this centroid state by querying the 
QRAM (Section 10.1). 

An extension of this algorithm avoids using the centroid vector, as this form of 
nearest-neighbor classification performs poorly if the classes do not separate well, 
or if the shape of the classes is complicated and the centroid does not lie within the 
class (Wiebe et al., 2014). Instead, we can evaluate the distances between all vectors 
in a class and the target vector, and the task becomes deciding whether the following 
inequality is true: 
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min ||x — x;|| < min ||x — x;|. (12.1) 
xjEc] i xjEc2 f 

For this variant, we assume that the vectors are f-sparse—the vectors do not contain 
more than f nonzero entries. In many classical algorithms, execution speed—and not 
complexity—is improved by exploiting sparse data structures. Such structures are 
common in applications such as text mining and recommendation engines. We also 
assume that there is an upper bound rmax on the maximum value of any feature in the 
data set. We further assume that we have two quantum oracles: 


OLIE 10) == U) i) lagi). 
FUND = DFG, D), 
where f (j, D) gives the location of the /th nonzero entry in xj. 
Given these conditions, finding the maximum overlap, max; |(x|x;)|“, within error 
at most € and with success probability at least 1 — 5 requires an expected number of 
queries that is on the order of 


Jr ($) 


E 


(12.2) 


| 2 


O 


(12.3) 


To achieve this bound, we use a swap test (Section 10.2) on the following states: 


1 , r E X; 
gl b= Eeto + 1) } I), 
i 


max Tmax 


L Ji — 70% eito 4. 0414) 
Vf i Fa Tmax ? 


where rj; comes from the polar form of the number xj; = rei, These states are 
prepared by six oracle calls and two single-qubit rotations (Wiebe et al., 2014). 

We would like to determine the probability of obtaining 0. To get a quick estimate, 
we do not perform a measurement in the swap test, but apply amplitude estimation 
instead to achieve a scaling of O(1/e) in estimating P(0) with € accuracy. Let us 
denote this estimate by P(0). Given a register of dimension R, the error of this 
estimation is bounded by 


(12.4) 


Po) PO) <7 +7, (12.5) 
TZR R l 
This way, R must be at least R > [4r (7 + 1)f?r4.,, /€] to achieve an error bound of 
€/2. Once we have an estimate of P(0), then the overlap is given by 
Lxx)? = CPO — D rha (12.6) 


A similar result holds for the Euclidean distance. 
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An alternative view on nearest-neighbor clustering is to treat the centroid vectors as 
template states, and deal with the problem as quantum template matching. Quantum 
template matching resembles various forms of state tomography. In quantum state 
discrimination, we know the prior distribution of a set of states {o;}, and the task is 
to decide which state we are receiving. In quantum state estimation, we reconstruct 
a given unknown state by estimating its parameters. In template matching, we have a 
prior set of states, as in state discrimination, but the unknown state is also characterized 
by parameters, as in state estimation (Sasaki and Carlini, 2002; Sasaki et al., 2001). 

To estimate the distance between the unknown state and the target template state, 
we use fidelity (Section 4.7), and we choose the template for which the fidelity is the 
largest. The matching strategy is represented by a positive operator—valued measure, 
{Pj}, and its performance is measured by its average score over all matches. The 
scenario applies to both classical and quantum inputs. In the first setting, the template 
states are derived from classical information. In the second setting, they generalize to 
fully quantum template states, where only a finite number of copies of each template 
state are available. Yet, a generic strategy for finding the optimal positive operator— 
valued measure, does not exist, and we can deal only with special cases in which the 
template states have a specific structure. 


12.2 Support Vector Machines with Grover's Search 


The simplest form of quantum support vector machines observes that if the parameters 
of the cost function are discretized, we can perform an exhaustive search in the cost 
space (Anguita et al., 2003). The search for the minimum is based on a variant of 
Grover’s search (Durr and Hoyer, 1996). 

The idea’s simplicity is attractive: there is no restriction on the objective function. 
Since we do not depend on an algorithm like gradient descent, the objective function 
might be nonconvex. The real strength of quantum support vector machines in this 
formulation might be this ability to deal with nonconvex objective functions, which 
leads to better generalization performance, especially if outliers are present in the 
training data. 

As pointed out in Section 9.3, a convex loss function applies a large penalty to 
points with large negative margin—these are points which are incorrectly classified 
and are far from the boundary. This is why outliers and classification noise affect 
a convex loss function. Carefully designed nonconvex loss functions may avoid this 
pitfall (Section 9.4). 

Since adiabatic quantum optimization is ideal for such nonconvex optimiza- 
tion (Denchev et al., 2012), it would be worth casting support vector machines as 
a quadratic unconstrained binary optimization program (Section 14.2), suitable for an 
adiabatic quantum computer. 

This formulation leaves calculating the kernel matrix untouched, and it has O(N 2d) 
complexity. Although the search process has a reduced complexity, calculations will 
be dominated by generating the kernel matrix. 
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12.3 Support Vector Machines with Exponential Speedup 


If we calculate only the kernel matrix by quantum means, we have a complexity of 
O(M?(M + e7! log N)). There is more to gain: exponential speedup in the number 
of training examples is possible when using the least-squares formulation of support 
vector machines (Section 7.5). 

The algorithm hinges on three ideas: 


* Quantum matrix inversion is fast (Harrow et al., 2009). 

+ Simulation of sparse matrixes is efficient (Berry et al., 2007), 

+ Nonsparse density matrices reveal the eigenstructure exponentially faster than in classical 
algorithms (Lloyd et al., 2013b). 


To solve the linear formulation in Equation 7.28, we need to invert 
0 1l 
F=({ noe) (12.7) 


The matrix inversion algorithm needs to simulate the matrix exponential of F. We split 
Fas F = J + K,, where 


01! 
J= ( 10 ) (12.8) 
is the adjacency matrix of a star graph, and 
0 0 
eee (12.9) 


hi; n NS E 
We normalize F with its trace: F = iF) = K 


By using the Lie product formula, we get the exponential as 
e iF At L emi JAt/t(Ky) iy TAt/t(Ky) .-iK At/tr(Ky) + O(AL). (12.10) 


To obtain the exponentials, the sparse matrices J and the constant multiply of the 
identity matrix are easy to simulate (Berry et al., 2007). 

On the other hand, the kernel matrix K is not sparse. This is where quantum self- 
analysis helps: given multiple copies of a density matrix p, it is possible to perform 
eit; this resembles quantum principal component analysis (Section 10.3). With 
the quantum calculation of the dot product and access to a QRAM (Sections 10.1 
and 10.2), we obtain the normalized kernel matrix 


N 


SY lxxi G. (12.11) 


ij=1 


K — 
tr(K)  tr(K) 


R= 


K is a normalized Hermitian matrix, which makes it a prime candidate for quantum 
self analysis. The exponentiation is done in O(log N) steps. 

We use Equation 12.10 to perform quantum phase estimation to get the eigenvalues 
and eigenvectors. If the desired accuracy is €, we need n = O(€~*) copies of the state. 
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Then, we express the y in Equation 7.28 in the eigenbasis, and invert the eigenvalues 
to obtain the solution |b, œ) of the linear equation (Harrow et al., 2009). 

With this inversion algorithm, the overall time complexity of training the support 
vector parameters is O(log(Nad)). 

The kernel function is restricted. Linear and polynomial kernels are easy to 
implement; in fact, the inner product is evaluated directly in the embedding space. 
Radial basis function kernels are much harder to imagine in this framework, and these 
are the comment kernels. The approximation methods for calculating the kernel matrix 
apply (Section 10.2). 

O(log(MN)) states are required to perform classification. This is advantageous 
because it compresses the kernel exponentially. Yet, it is also a disadvantage because 
the trained model is not sparse. A pivotal point in the success of support vector 
machines is structural risk minimization: the learned model should not overfit the data, 
otherwise its generalization performance will be poor. The least-squares formulation 
and its quantum variant are not sparse: every data instance will become a support 
vector. None of the a; values will be zero. 


12.4 Computational Complexity 


Equation 12.3 indicates that quantum nearest neighbors is nearly quadratically better 
than the classical analogue even with sparse data. This is remarkable as most quantum 
learning algorithms do not assume a sparse structure in the quantum states. 

Since there are no strict limits on the rate of convergence of the gradient descent in 
the optimization phase of support vector machines, it is difficult to establish how much 
faster we can get if we use Grover’s algorithm to perform this stage. The bottleneck 
in this simple formulation remains the O(N7d) complexity of calculating the kernel 
matrix. 

If we rely on quantum input and output states, and replace the calculation of the 
kernel matrix with the method outlined in Section 10.2, we can achieve an exponential 
speedup over classical algorithms. Using the least-squares formulation of support 
vector machines, the self-analysis of quantum states will solve the matrix inversion 
problem, and the overall complexity becomes O(log(Nd)). 
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Quantum Process Tomography 1 
and Regression 


Symmetry is essential to why some quantum algorithms are successful—quantum 
Fourier transformation and Shor’s algorithm achieve an exponential speedup by 
exploiting symmetries. A group can be represented by sets of unitary matrices with the 
usual multiplication rule—that is, sets of elements with a binary operation satisfying 
algebraic conditions. Groups are essential in describing symmetries, and since no 
such representation is possible using stochastic matrices on classical computers, 
we see why symmetries are so important in quantum computing. Quantum process 
tomography is a procedure to characterize the dynamics of a quantum system: 
symmetry and the representation theory of groups play an important role in it, which 
in turn leads to the efficient learning of unknown functions. 

The task of regression thus translates well to quantum process tomography. The 
dynamic process which we wish to learn is a series of unitary transformations, also 
called a channel. If we denote the unitary by U, the goal becomes to derive an estimate 
U such that U(pin) = Pout. Then, as in the classical case, we wish to calculate U(p) 
for a new state p that we have not encountered before. 

In a classical setting, we define an objective function, and we seek an optimum 
subject to constraints and assumptions. The assumption in learning by quantum 
process tomography is that the channel is unitary and that the unitary transformation 
is drawn from a group—that is, it meets basic symmetry conditions. The objective 
function is replaced by the fidelity of quantum states. 

Apart from these similarities, the rest of the learning process does not resemble the 
classical variant. Unlike in the classical setting, learning a unitary requires a double 
maximization: we need an optimal measuring strategy that optimally approximates 
the unitary, and we need an optimal input state that best captures the information of 
the unitary (Acin et al., 2001). This resembles active learning, where we must identify 
data instances that improve the learned model (Section 2.3). 

The key steps are as follows: 


* Storage and parallel application of the unitary on a suitable input state that achieve optimum 
storage (Section 13.5). 

+ A superposition of maximally entangled states is the optimal input state (Section 13.6). 

+ A measure-and-prepare strategy on the ancilla with an optimal positive operator—valued 
measure (POVM) is best for applying the learned unitary on an arbitrary number of new 
states (Section 13.7). 
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More theory from quantum mechanics and quantum information is necessary to 
understand how an unknown transformation can be learned. The key concepts are 
the Choi-Jamiotkowski duality (Section 13.1) that allows ancilla-assisted quantum 
process tomography (Section 13.2), compact Lie groups (Section 13.3) and their 
Clebsch-Gordan decomposition that define which representations are of interest, and 
covariant POVMs that will provide the optimal measurement (Section 13.4). These 
concepts are briefly introduced in the following sections, before we discuss how 
learning is performed. 


13.1 Channel-State Duality 


Given the properties of quantum states and transformations, noticing the similarities is 
inevitable. There is, in fact, a correspondence between quantum channels and quantum 
states. The Choi-Jamiotkowski isomorphism establishes a connection between linear 
maps from Hilbert space Hı to Hilbert space H2 and operators in the tensor product 
space Hı ® H2. 

A quantum channel is a completely positive, trace-preserving linear map 


® : L(H1) > L(H2). (13.1) 


Here £(H) is the space of linear operators on H. The map ® takes a density matrix 
acting on the system in the Hilbert space Hı to a density matrix acting on the system 
in the Hilbert space H2. 

Since density matrices are positive, ® must preserve positivity, hence the require- 
ment for a positive map. Furthermore, if an ancilla of some finite dimension n is 
coupled to the system, then the induced map J, ® ®, where [, is the identity map 
on the ancilla, must be positive—that is, In ® ® is positive for all n. Such maps are 
called completely positive. 

The last constraint, the requirement to preserve the trace, derives from the density 
matrices having trace 1. 

For example, the unitary time evolution of a system is a quantum channel. It maps 
states in the same Hilbert space, and it is trace-preserving because it is unitary. More 
generic quantum channels between two Hilbert spaces act as communication channels 
which transmit quantum information. In this regard, quantum channels generalize 
unitary transformations. 

We define a matrix for a completely positive, trace-preserving map ® in the 
following way: 

n 
po = > lei) lejl ® (lei) (ejl). (13.2) 
ij=l 
This is called the Choi matrix of ®. By Choi’s theorem on completely positive maps, 
® is completely positive if and only if pẹ is positive (Choi, 1975; Jamiołkowski, 
1972). The operator pe is a density matrix, and therefore it is the state dual to the 
quantum channel ®. 
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The duality between channels and states is thus the linear bijection 
® > po. (13.3) 


This map is known as the Choi-Jamiotkowski isomorphism. This duality is convenient, 
as instead of studying linear maps, we can study the associated linear operators, which 
underlines the analogue that these maps are the generalization of unitary operators. 


13.2 Quantum Process Tomography 


Quantum state tomography is a process of reconstructing the state for a source of a 
quantum system by measurements on the system. In quantum process tomography, 
the states are known, but the quantum dynamics of the system is unknown—the goal 
is to characterize the process by probing it with quantum states (Chuang and Nielsen, 
1997). The dynamics of a quantum system is described by a completely positive linear 
map: 


Elp) = X AipA}, (13.4) 


where > Ala = Í ensures that the map preserves the trace. This is called Krauss’s 
form. 

Let {E;} be an orthogonal basis for B(#1), the space of bounded linear operators on 
H. The operators A; are expanded in this base as Aj = Si aimEm. Thus, we have 


Elp) = > XmnEmpE}, (13.5) 


m,n 


where Xmn = X: AmiG,,. The map x completely characterizes the process € in this 
basis. 

There are direct and indirect methods for characterization of a process, and they are 
optimal for different underlying systems (Mohseni et al., 2008). Indirect methods rely 
on probe states and use quantum state tomography on the output states. With direct 
methods, experimental outcomes directly reveal the underlying dynamic process; 
there is no need for quantum state tomography. 

Among indirect approaches, ancilla-assisted quantum process tomography is of 
special interest (Altepeter et al., 2003; D’ Ariano and Lo Presti, 2003; Leung, 2003). It 
uses the Choi-Jamiotkowski isomorphism (Section 13.1) to perform state tomography 
and reveal the dynamic process. Consider the correspondence 


pe = (E@1)(\&*)(’*}), (13.6) 


where |®+) = ya (l /d)|i) Q |i) is the maximally entangled state of the system 
under study and an ancilla of the same size. The information about the dynamics is 
imprinted on the final state: our goal reduces to characterizing pg. 

Assume that the initial state of the system with the ancilla is pap = iy pyEt ® 


eg where {EA} and {EP } are the operator bases in the respective space of linear 
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operators of Ha and Hg. The output state after applying E is 


Pap = Es  IB)(paB) = D> pijXmnEnE;S En’ Q EP. (13.7) 
ij,m,n 
Substituting &y = Ð mni Xmn pio”, where EARARAT — gag ie we get the 
simple formula 


Pap = > &yE} Q EP. (13.8) 
kj 


Notice that the values of gon depend only on the choice of the operator basis. The & 


values can be obtained by joint measurements of the observables Eå 8 E? . They are 
obtained as i 


üy = ilohni @ E;"). (13.9) 


Through the values of aj, we obtain Xn, thus characterizing the quantum process. 
While many classes of input states are viable, a maximally entangled input states 
yields the lowest experimental error rate, as p has to be inverted (Mohseni et al., 2008). 


13.3 Groups, Compact Lie Groups, and the Unitary Group 


Compact groups are natural extensions of finite groups, and many properties of finite 
groups carry over. Of them, compact Lie groups are the best understood. We will need 
their properties later, so we introduce a series of definitions that are related to these 
structures. 

A group G is a finite or infinite set, together with an operation - such that 


81°82€G Vgi,g2 €G, 

(1 : 82) -83 = 81: (82-83) Vg1,82,83 E€ G, 
deeG: e-g=g-e=g VgeG, 
VgeG Fh: g-h=h-g=e. 


(13.10) 


We often omit the symbol - for the operation, and simply write g1g2 to represent 
the composition. The element e is the unit element of the group. The element / in this 
context is the inverse of g, and it is often denoted as g7!. 

A topological group is a group G where the underlying set is also a topological 
space such that the group’s operation and the group’s inverse function are continuous 
functions with respect to the topology. The topological space of the group defines 
sets of neighborhoods for each group element that satisfy a set of axioms relating the 
elements and the neighborhoods. The neighborhoods are defined as subsets of the set 
of G, called open sets, satisfying these conditions. 

A function f between topological spaces is called continuous if for all g € G and 
all neighborhoods N of f(g) there is a neighborhood M of g such that f(M) CN. A 
special class of continuous functions is called homeomorphisms: these are continuous 
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bijections for which the inverse function is also continuous. Topological groups 
introduce a sense of duality: we can perform the group operations on the elements 
of the set, and we can talk about continuous functions due to the topology. 

A topological manifold is a topological space that is furthered characterized by 
having a structure which locally resembles real n-dimensional space. A topological 
space X is called locally Euclidean if there is a nonnegative integer n such that every 
point in X has a neighborhood which is homeomorphic to the Euclidean space R”. A 
topological manifold is a locally Euclidean Hausdorff space—that is, distinct points 
have distinct neighborhoods in the space. 

To do calculus, we need a further definition. A differentiable manifold is a 
topological manifold equipped with an equivalence class of atlases whose transition 
maps are all differentiable. Here an atlas is a collection of charts, where each 
chart is a linear space where the usual rules of calculus apply. The differentiable 
transitions between the charts ensure there is a global structure. A smooth manifold 
is a differentiable manifold for which all the transition maps have derivatives of all 
orders—that is, they are smooth. 

The underlying set of a Lie group is also a finite-dimensional smooth manifold, 
and in which the group operations of multiplication and inversion are also smooth 
maps. Smoothness of the group multiplication means that it is a smooth mapping of 
the product manifold G x G to G. 

A compact group is a topological group whose topology is compact. The in- 
tuitive view of compactness is that it generalizes closed and bounded subsets of 
Euclidean spaces. Formally, a topological space X is called compact if each of its 
open covers has a finite subcover—that is, for every collection {Uy}vea of open 
subsets of X such that X = |J„c4 Ua there is a finite subset J of A such that 
X= Vies Ui. 

A compact Lie group has all the properties described so far, and it is a well- 
understood structure. An intuitive, albeit somewhat rough way to think about compact 
Lie groups is that they contain symmetries that form a bounded set. 

To gain insight into these new definitions, we consider an example, the circle group, 
denoted by U(1), which is a one-dimensional compact Lie group. It is the unit circle 
on the complex plain with complex multiplication as the group operation: 


Ud) ={zeC: |z| = 1). (13.11) 


The notation U(1) refers to the interpretation that this group can also be viewed as 
1 x 1 unitary matrices acting on the complex plane by rotation about the origin. 

Complex multiplication and inversion are continuous functions on this set; hence, 
it is a topological group. 

Furthermore, the circle is a one-dimensional topological manifold. As multiplica- 
tion and inversion are analytic maps on the circle, it is a smooth manifold. The unit 
circle is a closed subset of the complex plane; hence, it is a compact group. The circle 
group is indeed a compact Lie group. 

If we generalize this example further, the unitary group U(n) of degree n is the 
group ofn x n unitary matrices, with the matrix multiplication as the group operation. 
It is a finite-dimensional compact Lie group. 
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13.4 Representation Theory 


Representation theory represents elements of groups as linear transformations of 
vector spaces, as these latter structures are much easier to understand and work with. 

Let G be a group with a unit element e, and let H be a Hilbert space. A unitary 
representation of G is a function U : G œ> B(H), g> Ug, where {Ug} are unitary 


operators such that 
Usgigs = Ug, Ug, Ve1,92 € G, (13.12) 
U. =1. l 


Naturally, the unitaries themselves form a group; hence, if the map is a bijection, then 
{Ug} is isomorphic to G. 

Recall that the Schrödinger evolution of a density matrix is given as UpUŤ (see 
Equation 3.42). If the unitaries are from a representation of a group, we can study the 
automorphisms as the action of the group on the corresponding Hilbert space: 


Ag(p) = UgpUj. (13.13) 
Following the properties of a unitary representation in Equation 13.12, we have 


Agigea (P) = Ag (Ag2(e)), 
Ae(p) = p, (13.14) 
Ay-i(Ag(p)) = A: 
As we are working with quantum channels, the linear operators may not be unitary, 
but we restrict our attention to unitary channels. 


Since the unitary representation is a group itself, this defines a projective represen- 
tation, which is a linear representation up to scalar transformations 


Ug, Ug, = @(81, 82) Ugi 99 o z : (13.15) 
ef, 


where w(g1, 82) is a phase vector—that is, |w(g1, g2)| = 1. It is called a cocycle. We 
further require that 


(81, 8283)@(82, 83) = @(81, 82)@ (8182,83) Y81, 82,83 € G, 
w(e,g) = wo(g,e)=1 Vg eG. (13.16) 


A representation of the unitary group is irreducible in an invariant subspace if the 
subspace does not have a proper subspace that is invariant. Any unitary representation 
{Ue} of a compact Lie group can be decomposed into the direct sum of a discrete 
number of irreducible representations. This decomposition is essential in studying the 
properties of the representation. 

More formally, let H be a Hilbert space and let {U,|g € G} be a projective 
representation. An invariant subspace W C H is a subspace such that UW = W 
Vg € G. Then, the representation {U,} is irreducible in W if there is no proper 
invariant subspace V, {0} A V C W. 
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If G is a compact Lie group, and {U,|g € G} a projective representation of G on the 
Hilbert space H, then U, can be decomposed into the direct sum of a discrete number 
of irreducible representations. 

Two projective irreducible representations {Ug} and {Vp} acting in the Hilbert 
spaces Hı and H2, respectively, are called equivalent if there is an isomorphism T : 
Hi th H2 such that T'T = Ih, TT’ = IH, and TU, = V,T Vg € G. In other words, 
Vg = TUT", which implies that the two representations have the same cocycle. The 
isomorphism T is called an intertwiner. 

Let {Ug} and {Vg} be two equivalent projective irreducible representations. Then, 
any operator O : Hı +> H2 such that OU, = V,O0 Vg € G has the form O = AT, 
where A € C is a constant, and T : Hy > H2 is an intertwiner. 

In matrix form, this means that if a matrix O commutes with all matrices in {Ug}, 
then O is a scalar matrix. This is known as the Shur lemma. 

If the two projective irreducible representations {Ug} and {Vz} are inequivalent, 
then the only operator O : Hı œ> H2 such that OU = V,O Vg €G is the null 
operator O = 0. 

We decompose a group representation {Ug} acting in the Hilbert space H in a 
discrete number of irreducible representations, each of them acting on a different 
subspace: 


Ug = Bues On, UH, (13.17) 


where S is the set of equivalence classes of irreducible representations contained in the 
decomposition, and m, is the multiplicity of equivalent irreducible representations in 
the same equivalence class. The irreducible representations are naturally associated 
with irreducible orthogonal subspace of the Hilbert space H: 


H = Dues Pph HË. (13.18) 
With use of the Shur lemma, it is trivial to show that if O € B(H) is in the 
commutator of {Ug} ([O, Ug] = 0 Yg € G), then O can be written as 


O = Pres Diy Ay Ty» (13.19) 


where ai € C are suitable constants. 

We simplify the decomposition of the Hilbert space further by replacing a direct 
sum with a tensor product of an appropriate complex space. Consider the invariant 
subspaces HË and HË , which correspond to equivalent irreducible representations 
in the same equivalence class, and two orthonormal bases in theses subspaces Be = 
{lu,in)|n=1,...,dy} and B; = {lu,j,n)|n = 1,...,d,,} such that 


|W, in) = Ty |u, j n) Vn =1,...,dy. (13.20) 
This view enables us to identify a basis vector with the following tensor product: 

lu, in) = |u,n) ® |i). (13.21) 
Here {|,i,n)|n = 1,...,d,} and {|i)|i= 1,...,m,} are orthonormal bases of H, 


and C’“, respectively. Hp, is called the representation space, and C”” is the 


132 Quantum Machine Learning 


multiplicity space. With this identification, we obtain the Clebsch-Gordan tensor 
product structure—also called the Clebsch-Gordan decomposition: 


H = Dues Hu @C™. (13.22) 


The intertwiner in this notation reduces to the tensor product of an identity operator 
and a projection between complex spaces: 


Tj = la, Q li) (Gl. (13.23) 
The matching unitary representation {U,} has a simple form: 
Ug = Dues UY @ Im,» (13.24) 


Covariant POVMs play a major role in the optimal estimation of quantum states: 
they address the optimal extraction of information from families of states that are 
invariant under the action of a symmetry group. A POVM P(dw) is covariant if and 
only for any state pọ € S(H) the probability distribution p(B|o) = tr(P(B)p) is group- 
invariant—that is, 


P(B|p) = p(gB\Ag(p)) VB € oo), (13.25) 


where gB = {gw|w € B} and Ag(p) = UgpU; . 

Let G be a Lie group. For any fixed group element h € G, the map gb hg is a 
diffeomorphism, and transforms the region B C G into the region AB = {hg|g € B}. 
A measure (dg) on the Lie group G is called left-invariant if u (8B) = uL (B), 
for any group element g € G and for any region B C G. Right invariance is defined 
similarly. While any Lie group has a left-invariant and a right-invariant measure, they 
do not necessarily coincide. If they do, the group is called unimodular, and the measure 
is called invariant. 

Let G be a locally compact unimodular group, and let {U,} be a projective 
representation of G in the Hilbert space H. Let Ho C G be a compact subgroup. Then, 
a POVM P(dw) with outcome space Q = G/Họ is covariant if and only if it has the 
form 


P(dw) = Ug(w) EU; (w) v(de), (13.26) 


where g(w) € G is any representative element of the equivalence class w € Q, v(dw) 
is the group-invariant measure over w, and & is an operator satisfying the properties 
=a>0 


[E, Un] =0 Yh € Ho 
f U,BUfdg = I, 
G 


where dg is the invariant Haar measure over the group G. 

This establishes a bijective correspondence between a covariant POVM with 
outcome space Q = G/Hpo with a single operator &, which is called the seed of the 
POVM. 


(13.27) 
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Given a Clebsch-Gordan tensor product structure for the elements of a representa- 
tion {Ug} in the form of Equation 13.24, the normalization condition in the theorem 
for the seed of the POVM can be written in the simple form 


try, (E) = dulm,. (13.28) 


13.5 Parallel Application and Storage of the Unitary 


The task is simple: we have a black box that implements an unknown unitary U and 
we can make N calls to it to identify the unitary. This is fundamentally different from 
the classical setting in how the training instances are acquired. In regression, we have 
labeled examples {(xX1,y1),..., (Xy, yy)} to which we fit a function. In the quantum 
setting, we assume that we are able to use the function generating the labels N times. 
While in the classical scenario there is little difference between using the pairs (X;, yi), 
or (x;,f(x;)), where f is the original function generating the process, in the quantum 
setting, we must have access to U. 

Regression in a classical learning setting finds an estimator function in a family of 
functions characterized by some parameter 0 (Sections 8.1 and 8.2). In the quantum 
learning setting, the family of functions is the unitary representation {U,} of a 
compact Lie group G, and the parameter we are estimating is the group element 
g € G. We omit the index of Uy in the rest of the discussion. The generic process 
of learning and applying the learned function to a single new output is outlined in 
Figure 13.1. 


Figure 13.1 Outline of quantum learning of a unitary transformation U. An optimal input 
state pe is a maximally entangled state with an ancilla, consisting of a direct sum of identity 
operators over irreducible spaces of U®. This state is acted on by N parallel calls of U—that 
is, by U®’ —while the action on the ancilla is trivial (78). To apply the learned unitary U, 
we perform an optimal POVM measurement strategy P;, on the output state, and apply the 
estimated unitary on the new state p. 
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U acts on a finite d-dimensional Hilbert space, and performs a determinis- 
tic transformation belonging to a given unitary representation of a compact Lie 
group (Chiribella et al., 2005). If we are guaranteed to have N calls of U, two 
immediate problems arise: 


1. How do we store the approximated unitary? 
2. How do we dispose the N uses, in parallel or in sequence? 


The first question is easier to address: the Choi-Jamiotkowsky duality in Equa- 
tion 13.3 enables us to treat the transformation of the quantum channel as a state. 

To find out the answer to the second question, we define our objective: we 
maximize the fidelity of the output state with the target state, averaging over all 
pure input states and all unknown unitaries in the representation of the group 
with an invariant measure. Thus, the objective function, the channel fidelity is 
written as 


F= z fe {e[we wwe] av, (13.29) 


where we work in the computational basis. Here L is a positive operator describing 
the operation of the quantum channel, including the application of U to a new input 
state that is not among the training instances. This inclusion of the new input in the 
objective already hints at transduction. 

If we take the fidelity of output quantum states as the figure of merit, the optimal 
storage is achieved by a parallel application of the unitaries on an input state. Denote 
by Hi the Hilbert space of all inputs of the N examples, and denote by Ho the Hilbert 
space of all outputs. With US” acting on H;, we have the following (Bisio et al., 
2010): the optimal storage of U can be achieved by applying US” @ I2% on a suitable 
multipartite input state |p) € Ho Q Hi. 

The corollary of this lemma is that the training time is constant: it does not 
depend on the number of examples N. This is a remarkable departure from classical 
algorithms. The next question is what the suitable input state might be. 


13.6 Optimal State for Learning 


Classical learning takes the training examples, and tries to make the best use of 
them. Certain algorithms even learn to ignore some or most of the training examples. 
For instance, support vector machines build sparse models using a fraction of the 
training data (Section 7.2). Some classical learning approaches may ask for specific 
extra examples, as, for instance, active learning does. Active learning is a variant 
of semisupervised learning in which the learning algorithm is able to solicit labels 
for problematic unlabeled instances from an appropriate information source—for 
instance, from a human annotator (Section 2.3). Similarly to the semisupervised 
setting, there are some labels available, but most of the examples are unlabeled. The 
task in a learning iteration is to choose the optimal set of unlabeled examples for which 
the algorithm solicits labels. 
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Optimal quantum learning of unitaries resembles active learning in a sense: it 
requires an optimal input state. Since the learner has access to U, by calling the 
transformation on an optimal input state, the learner ensures that the most important 
characteristics are imprinted on the approximation state of the unitary. 

The quantum learning process is active, but avoids induction. In inductive learning, 
based on a set of data points—labeled or unlabeled—we infer a function that will be 
applied to unseen data points. Transduction avoids this inference to the more general, 
and it infers from the particular to the particular (Section 2.3). This way, transduction 
asks for less: an inductive function implies a transductive one. That is, given our 
training set of points {(x1, y1),..., (Xv. yw)}, we are trying to label just M more points 
{Xv+1,---,Xv4m}. Inductive learning solves a more general problem of finding a rule 
to label any future object. 

Transduction resembles instance-based learning, a family of algorithms that com- 
pare new problem instances with training instances—K-means clustering and K- 
nearest neighbors are two examples (Sections 5.3 and 7.1). If some labels are 
available, transductive learning resembles semisupervised learning. Yet, transduction 
is different: instance-based learning can be inductive, and semisupervised learning is 
inductive, whereas transductive learning avoids inductive reasoning by definition. In 
a way, the goal in transductive learning is actually to minimize the test error, instead 
of the more abstract goal of maximizing generalization performance (El- Yaniv and 
Pechyony, 2007). 

The way transduction manifests itself in the optimal learning of a unitary transform 
is also via the input state. The input state is a superposition, and the probabilities of 
the superposition depend on M. 

A further requirement of the input state is that it has to be maximally entangled. 
To give an example of why entanglement is necessary, consider superdense coding: 
two classical bits are sent over a quantum channel by a single qubit. The two 
communicating parties, Alice and Bob, each have half of two entangled qubits—a 
Bell state. Alice applies one of four unitary transformations on her part, translating 
the Bell state into a different one, and sends the qubit over. Bob measures the state, 
and deduces which one of the four operations was used, thus retrieving two classical 
bits of information. The use of entanglement with an ancillary system improves the 
discrimination of unknown transformations; this motivates the use of such states in 
generic process tomography (Chiribella et al., 2005). 

To bring active learning, transduction, and entanglement into a common frame- 
work, we use the Clebsch-Gordon decomposition of USN. The following lemma 
identifies the optimal input state (Bisio et al., 2010). The optimal input state for storage 
can be taken to be of the form 


Dj ~ 
Id) = ®jemyen) PRA EH, (13.30) 
J 


where p; are probabilities, the index j runs over the set Irr(U8) of all irreducible 
representations {U;} contained in the decomposition of {U8}, dj is the dimension 


of the corresponding subspace Hj, and H = Bjerrum) (M; ® Hj) is a subspace of 
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Ho ® Hi carrying the representation U=@ jicIrr(U2V) (Uj ® Ij), Ij being the identity 
in Hj. 

The optimal state for the estimation of an unknown unitary is always a super- 
position of maximally entangled states, similarly to the case of quantum process 
tomography (Mohseni et al., 2008). Yet, this result is subtler: what matters here is 
the entanglement between a space in which the action of the group is irreducible and 
a space in which the action of the group is trivial (Chiribella et al., 2005). 

Requiring this form of input state is a special variant of active learning in which 
we are able to request the unitary transformation of this maximally entangled state in 
parallel. 

From the optimal input state in Equation 13.30, it is not immediately obvious 
why learning the unitary is a form of transduction. The way the entangled state is 
constructed is through irreducible representations, which in turn encode symmetries 
and invariances of U. These characteristics depend primarily on how U should act 
on the pin states to gain the target pout states. The probabilities p;, however, depend 
on M, and thus on the test data, which indeed confirms that this learning scenario 
is transduction. The retrieval strategy that applies U performs measurements, and to 
obtain the highest fidelity, p; must be tuned. We detail this in Section 13.7. 

Choosing the optimal input state is necessary in similar learning strategies, such 
as a binary classification scheme based on the quantum state tomography of state 
classes through Helstrom measurements (Guta and Kottowski, 2010). In this case, the 
POVM is the learned decision model, and it is probabilistic: the quantum system has 
an additional parameter for the identification of the training set that makes it easier to 
predict the output. Similarly, Bisio et al. (2011) require a maximally entangled state to 
learn measurements, and then perform transduction to a new state; this is the M = 1 
case. Generalizing this method to larger M values is difficult. 


13.7 Applying the Unitary and Finding the Parameter 
for the Input State 


The intuitive appeal of the optimal input state in Equation 13.30 is that states in this 
form are easy to distinguish. Consider the inner product of two such states, |f) and 
ln), for the N = 1 case: 


elon) = D Fra (13.31) 


jelrr(U) | 


If pj is chosen such that p; = Adj, then this equation becomes an approximation of the 
Dirac delta function in G: bgp = Liemen qu, of P) (Chiribella, 2011). Thus the 


choice of the input state approximates optimal separation. 

A measure-and-prepare strategy is optimal for applying the learned function an 
arbitrary number of times. This strategy consists in measuring the state |ġy) in the 
quantum memory with an optimal POVM, and performing the unitary on the new 


input state. Maximizing the fidelity with the measurement includes tuning pj. 
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There is no loss of generality in restricting the search space to covariant 
POVMs (Holevo, 1982), which corresponds to the estimation of unitaries that 
are picked out from a representation at random. Covariant POVMs of the form 
M(g) = U,8U} are of interest (see Equation 13.27), with a positive operator & 
satisfying the normalizing condition Sg M(g)dg = TI (Chiribella et al., 2005). The 
following theorem provides the optimal measurement to retrieve the approximated 
unitary (Bisio et al., 2010): 

The optimal retrieval of U from the memory state |@y) is achieved by measuring 
the ancilla with the optimal covariant POVM in the form & = |n)(n| (see Equa- 
tion 13.27)—namely, 


Pe = Ing (ngl; (13.32) 
given by 
Ing) = @j/4ilU)), (13.33) 


and, conditionally on outcome U, by performing the unitary Ô on the new input 
system. 

Considering the fidelity of learning the unitary in Equation 13.29 with this covari- 
ant POVM, we can write the optimal probability coefficients as follows (Chiribella 
et al., 2005): 


dim; 


Dj= (13.34) 


where mj is the multiplicity of the corresponding space. This is true for the M = 
1 case. The more generic case for arbitrary M takes the global fidelity between 
the estimate channel Cy and U®“—that is, the objective function averages over 
(\tr(U'Cy)|/d)*. Hence, the exact values of pj always depend on M, tying the optimal 
input state to the number of applications, making it a case of transduction. 

Generally, fidelity scales as F « mee for instance, for qubit states. The measurement 
will be optimal in the fidelity of quantum states, and it is also optimal for the 
maximization of the single-copy fidelity of all local channels. The fidelity does 
not degrade with repeated applications. If we thus measure the unitary, we use an 
incoherent classical storage of the process—that is, we perform a procedure similar to 
ancilla-assisted quantum process tomography. This incoherent strategy detaches the 
unitary from the examples, and we store it in classical memory. While the optimal 
input state necessarily depends on M, the incoherent process induces a function 
that can be applied to any number of examples, which is characteristic to inductive 
learning. It is a remarkable crossover between the two approaches. 

While this is not optimal, it is interesting to consider what happens if we do not 
measure U, but store it in quantum memory, and retrieve it from there when needed. 
Naturally, with every application, the state degrades. Yet, this situation is closer to 
pure transductive learning. The stored unitary transform depends solely on the data 
instances. The coherent strategy does not actually learn the process, but rather keeps 
the imprint in superposition. 
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From the perspective of machine learning, restricting the approximating function 
to unitaries is a serious constraint. It implies that the function is a bijection, ruling 
classification out. This method is relevant in regression problems. A similar approach 
that also avoids using quantum memory applies to binary classification: the unitary 
evolution is irrelevant in this case, and the optimality of measurement ensures robust 
classification based on balanced sets of examples (Sentis et al., 2012). 


Boosting and Adiabatic 
Quantum Computing 


Simulated annealing is an optimization heuristic that seeks the global optimum of an 
objective function. The quantum flavor of this heuristic is known as quantum anneal- 
ing, which is directly implemented by adiabatic quantum computing (Section 14.1). 
The optimization process through quantum annealing solves an Ising model, which 
we interpret as a form of quadratic unconstrained binary optimization (Sections 14.2 
and 14.3). 

Boosting—that is, combining several weak classifiers to build a strong classifier— 
translates to a quadratic unconstrained binary optimization problem. In this form, 
boosting naturally fits adiabatic quantum computing, bypassing the circuit model of 
quantum computing (Section 14.4). 

The adiabatic paradigm offers reduced computational time and improved gen- 
eralization performance owing to the nonconvex objective functions favored under 
the model (Section 14.5). The adiabatic paradigm, however, suffers from certain 
challenges; the following are the most important: 


+ Weights map to qubits, resulting in a limited bit depth for manipulating the weights of opti- 
mization problems. 

+ Interactions in an Ising model are considered only among at most two qubits, restricting 
optimization problems to maximum second-order elements. 


Contrary to expectations, these challenges improve sparsity and thus affect general- 
ization performance (Section 14.6). 

Adiabatic quantum computing has a distinct advantage over competing quantum 
computing methods: it has already been experimentally demonstrated in multiple- 
qubit workloads. The hardware implementation is controversial for many reasons— 
for instance, entanglement between qubits, if it exists, belongs to a category that 
is easily simulated on classical systems. Therefore, it is good practice to keep 
the theory of adiabatic quantum computing and its current practice separate. The 
current implementations, nevertheless, impose further restrictions, which we must 
address: 


e Not all pairs of qubits are connected owing to engineering constraints: there is a limited 
connectivity between qubits, albeit the connections are known in advance. 

+ Optimization happens at a temperature slightly higher than absolute zero; hence, there is 
thermal noise and excited states may be observed. 
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We must adjust the objective function and the optimization process to accommodate 
these constraints (Section 14.7). 

Addressing all these constraints means multiple additional optimization steps using 
a classical computer: we must optimize the free parameter in quantum boosting, 
the free parameter in the bit width, and the mapping to actual qubit connectivity. If 
we discount the complexity of classical optimization steps, the size of the adiabatic 
gap controls the execution time, which may be quadratically faster than in classical 
boosting (Section 14.8). 


14.1 Quantum Annealing 


Global optimization problems are often NP-hard. In place of exhaustive searches, 
heuristic algorithms approximate the global optimum, aiming to find a satisfactory 
solution. A metaheuristic is a high-level framework that allows the generation of 
heuristic algorithms for specific optimization problems. Simulated annealing is a 
widely successful metaheuristic (Sörensen, 2013). 

As in the case of many other heuristic methods, simulated annealing borrows a 
metaphor from a physical process. In metallurgy, annealing is a process to create 
a defect-free crystalline solid. The process involves heating the material—possibly 
only locally—then cooling it in a controlled manner. Internal stress is relieved 
as the rate of diffusion of the atoms increases with temperature, and imperfectly 
placed atoms can attain their optimal location with the lowest energy in the crystal 
structure. 

Simulated annealing uses the metaphor of this thermodynamic process by mapping 
solutions of an optimization problem to atomic configurations. The random dislo- 
cations in annealing change the value of the solution. Simulated annealing avoids 
getting trapped in a local minimum by accepting a solution that does not lower the 
value with a certain probability. Accepting worse solutions was, in fact, the novelty of 
this metaheuristic. The probability is controlled by the “temperature”: the higher it is, 
the more random movements are allowed. The temperature decreases as the heuristic 
explores the search space. 

This technique is often used when the search space is discrete. Neighboring states 
in the search space are thus permutations of the discrete variables. Simulated annealing 
converges to the global optimum under mild conditions, but convergence can be 
slow (Laarhoven and Aarts, 1987). 

Quantum annealing replaces the thermodynamic metaphor, and uses the tun- 
neling field strength in lieu of temperature to control acceptance probabilities 
(Figure 14.1) (Finnila et al., 1994)). Neighborhood selection in simulated annealing 
does not change throughout the simulation, whereas tunneling field strength is 
related to kinetic energy, and consequently the neighborhood radius is decreased in 
subsequent iterations. Its further advantage is that the procedure maps to an actual 
physical process. Hence, it is possible to implement optimization directly on quantum 
hardware, bypassing the higher-level models of quantum computing. 
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Thermal 
annealing 


Figure 14.1 Comparison of thermal and quantum annealing: thermal annealing often has a 
higher barrier than quantum tunneling. 


Quantum annealing, however, is prone to getting trapped in metastable exited 
states, and may spend a long time in these states, instead of reaching the ground-state 
energy. The quantum adiabatic theorem helps solve this problem. 


14.2 Quadratic Unconstrained Binary Optimization 


We consider an alternative formulation of boosting. Instead of an exponential loss 
function in Equation 9.4, we define a regularized quadratic problem to find the 
optimal weights. The generic problem of quadratic unconstrained binary optimization 
(QUBO) is defined as follows: 


N 
argmin 2 WijXiXj, (14.1) 
ij=1 
such that 
xi € {0, 1} A E A (14.2) 
Given a set of weak learners {As|hs : Row {-1,1},s = 1,2,...,K}, boosting trans- 


lates to the following objective function (Neven et al., 2008): 


N 


x 2 
. 1 
argmin | ș X (>: Wshs (Xj) — sı) + Allwllo 
w 


i=l \s=1 
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2 K 
= argmin | — 53 (> Wshs(X;) -2 > wshs(xi)¥s +y; | + Allwllo 
w 


i=l s=1 s=1 


(14.3) 


Since y2 is just a constant offset, the optimization reduces to 


K N 
arginin (=> > > WsWt > hs (xi) hy (x; = 5 = as Ws 5 hs(xi)yi + awto) . 


s=1 t=1 s=1 i=l 
(14.4) 


If we discretize this equation (Section 14.6), boosting becomes a QUBO. 

Equation 14.4 offers insight into the inner workings of the optimization. It depends 
on the correlations between h; and the corresponding y; outputs. Those weak learners 
hi that have an output correlated with the labels y; cause the bias term to be lowered; 
thus, the corresponding w; weights have a higher probability of being 1. 

The quadratic terms, which express coupling, depend on the correlation between 
the weak classifiers. Strong correlations will force the corresponding weights to be 
lowered; hence, these weights have a higher probability of being 0. 

QUBO is an NP-hard problem, yet, since the search space of QUBO is discrete, 
it is well suited for simulated annealing (Katayama and Narihisa, 2001) and quantum 
annealing (Neven et al., 2008). Solutions on classical computers are simulated by tabu 
search, which is based on local searches of a neighborhood (Glover, 1989; Palubeckis, 
2004). 


14.3 Ising Model 


The Ising model is a model in statistical mechanics to study the magnetic dipole 
moments of atomic spins. The spins are arranged in a lattice, and only neighboring 
spins are assumed to interact. The optimal configuration is given by 


argmin(s' Js + h's), (14.5) 

S 
where s; E€ {—1, +1}—these variables are called spins. The J operator describes the 
interactions between the spins, and h stands for the impact of an external magnetic 


field. The Ising model transforms to a QUBO with a change of variables s = 2w — 1. 
The Ising objective function is represented by the following Hamiltonian: 


= > Jijo7o; + pe hjo;, (14.6) 
ij i 


where o;* is the Pauli Z operator acting on qubit i. 


Boosting and Adiabatic Quantum Computing 143 


The ground-state energy of this Hamiltonian will correspond to the optimum of the 
QUBO. Thus, we are seeking the following ground state: 


argmin(s' Hs). (14.7) 
S 


Take the base Hamiltonian of an adiabatic process as 


Hp =)> (- +) (14.8) 


i 


Inserting Hg and Aj in Equation 3.56, we have an implementable formulation of the 
Ising problem on an adiabatic quantum processor: 


A(A) = (1 — à)Hg + AA, (14.9) 


14.4 QBoost 


The Ising model allows mapping the solution of a QUBO problem to an adiabatic 
process. With this mapping, we are able to define QBoost, a boosting algorithm that 
finds the optimum of a QUBO via adiabatic optimization. 

Denote the number of weak learners by K; this equals N if no other constraints 
apply (see Section 14.7 for modifications if N is too large to be handled by the 
hardware). Further, let W denote the set of weak learners in the final classifier, and 
let T denote its cardinality. Algorithm 7 describes the steps of QBoost. 

As in LPBoost (Section 9.3), the feature space in QBoost is the output of weak 
learners. Hence, this method is not constrained by sequentiality either, unlike in the 
case of adaptive boosting (AdaBoost). 


14.5 Nonconvexity 


Nonconvexity manifests itself in two forms. One is the regularization term, the 
cardinality of the set of weights {w;}. The other source of nonconvexity is the loss 
function, albeit it is usually taken as a quadratic function. 

The regularization term in classical boosting is either absent or a convex function. 
Convexity yields analytic evidence for global convergence, and it often ensures the 
feasibility of a polynomial time algorithm. Nonconvex objective functions, on the 
other hand, imply an exhaustive search in the space. Adiabatic quantum computing 
overcomes this problem by the approximating Hamiltonian: we are not constrained by 
a convex regularization term. 

In the basic formulation of QBoost, we used a quadratic loss function, but we may 
look into finding a nonconvex loss function that maps to the quantum hardware. If 
there is such a function, its global optimum will be approximated efficiently by an 
adiabatic process—as in the case of a quadratic loss function. 
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ALGORITHM 7 QBoost 
Require: Training and validation data, dictionary of weak 
classifiers 
Ensure: Strong classifier 
Initialize weight distribution d over training samples as uniform 
distribution Vi: d(i) =1/N. 
Set T<0 and W < Ø 
repeat 
Select the set W of K—T weak classifiers hi from the 
dictionary that have the smallest weighted training error rates 
for A=Amin tO Amax do 
Find w*(A) = argminy(D 7-1 (Dnewuw Wehe(xi)/K — yi)” +Allwllo) 
Set TA) < |lw*()llo 
Construct H(x; A) << siga l wi(A)he(x)) 
Measure validation error Eyai(A) of H(x; A) on unweighted 
validation set 
end for 
* < argminy Eyai(A) 
Update T < T(A*) and W < {hy|wZ(A*) = 1} 
Update weights d(i) + d(i)QCn, ew ht(x)/T — yi)? 
Normalize d(i) < d(i)/ X} d(i) 
until validation error Eyai stops decreasing 


First, if we replace the quadratic loss function in Equation 14.3 by the optimal 
0-1 loss function, analytic approaches no longer apply in solving the optimization 
problem; in fact, it becomes NP-hard (Feldman et al., 2012). Convex approximation 
circumvents this problem, but label noise adversely affects convex loss functions 
(Section 9.3). 

To approximate the 0-1 loss function with a QUBO model, we are seeking a loss 
function that is a quadratic function. The simple quadratic loss in Equation 14.3 is a 
convex variant. To make this loss function robust to label noise, we modify it with a 
parameterization. We define q-loss as 


Lq(m) = min (a — q}, (max(0, 1 — m))*) : (14.10) 


where q € (—oo,0] is a parameter (Denchev et al., 2012). This loss function is no 
longer convex, but it does not map to the QUBO model. Instead of calculating q- 
loss directly, we approximate it via a family of quadratic functions characterized by a 
variational parameter t € R. The following is equivalent to Equation 14.10: 


Lq(m) = min (imo? + a -p =), (14.11) 


Since this equation includes finding a minimum, the objective function becomes 
a joint optimization, as it has to find the optimum for both w and t. The advantage 
is that it directly maps to a QUBO; hence, it can be solved efficiently by adiabatic 


Boosting and Adiabatic Quantum Computing 145 


quantum computing, albeit the actual formulation is tedious. The following formula 
assumes a Euclidean regularization term on linear decision stumps, with q-loss as the 
loss function: 


K 1 N K 2 N 
amn 5 WsWt È Yaw + bI] +b X ws É Fa 
"= pisl i=l s=l1 i=l 
K N N 2 
—2yjXis a{ 1 —2y; —(1—4q) 
E [eo] 
s=1 i=1 i=] 
K 
+o w, (14.12) 
s=l 


where tig, is the binary indicator bit of sign(t; — 1). The values in square brackets 
are the entries in the constraint matrix in the QUBO formulation. This formulation 
assumes continuous variables, so it needs to be discretized to map the problem to 
qubits. 

Finding the optimal ft reveals how q-loss handles label noise. In the second 
formulation of q-loss (Equation 14.11), t can change sign for a large negative margin, 
thus flagging mislabeled training instances. For any value of m, the optimal t of the 
minimizer (Equation 14.11) belongs to one of three classes: 


lLm>1 > SmMe=m. 
2 q<m<1 > f(m)=l. 
3. m<q4 > f(m=m. 


The first case implies a zero penalty for large positive margins. The second case 
reduces to the regular quadratic loss function. The third case can flip the label of a 
potentially mislabeled example, but still incurring a penalty of (1 — q)? to keep in 
order with the original labeling. This way, the parameter q regulates the tolerance of 
negative margins. 

As q — —œ, the loss becomes more and more similar to the regular convex 
quadratic loss function. As the other extreme, 0, is approached, the loss becomes more 
similar to the 0-1 loss. The optimal value of q depends on the amount of label noise in 
the data, and therefore it must be cross-validated. 


14.6 Sparsity, Bit Depth, and Generalization Performance 


Quantum boosting deviates from AdaBoost in two points. Instead of the exponential 
loss function in Equation 9.4, QBoost uses a quadratic loss. Moreover, it includes a 
regularization term to increase the sparsity of the strong learner. 


N 2 
argmin (>: (Zines) = vi) + Mto) . (14.13) 


k i=1 


146 Quantum Machine Learning 


The weights in this formulation are binary, and not real numbers as in AdaBoost. 
While weights could take an arbitrary value by increasing the bit depth, a few bits are, 
in fact, sufficient (Neven et al., 2008). 

A Vapnik-Chervonenkis dimension of a strong learner is dependent on the number 
of weak learners included (Equation 9.3). Thus, through Vapnik’s theorem (Equa- 
tion 2.15), a sparser strong learner with the same training error will have a tighter 
upper bound on the generalization performance; hence, the regularization term in the 
QUBO formulation is important. 

The weights are binary, which makes it easy to derive an optimal value for the A 
regularization parameter, ensuring that weak classifiers are not excluded at the expense 
of a higher training error. 

We must transform the continuous weights w; € [0, 1] to binary variables, which is 
easily achieved by a binary expansion of the weights. How many bits do we need for 
efficient optimization? As a binary variable is associated with a qubit, and current 
hardware implementations have a very limited number of qubits, it is of crucial 
importance to use a minimal number of bits to represent the weights. The structure 
of the problem implies a surprisingly low number of necessary bits. 

We turn to a geometric picture to gain insight into why this is the case. We map 
the binary expansion of a weight to a hypercube of 2's vertices, and with K weak 
learners, we have a total of (2Þits) £ vertices. 

For a correct solution of the strong classifier, we require that 


K 
yi X whx) >0 i=1,...,N. (14.14) 


s=1 


This inequality forces us to choose weights on either side on a hyperplane in RX 
for each training instance. The number of possible regions, Kyegions, that can be thus 
generated is bound by (Orlik and Terao, 1992) 


RN 
Kregions = 2 ea (14.15) 


The number of vertices in the hypercube must be at least as large as the number of 
possible regions. This can be relaxed by constructing the weak classifiers in a way such 
that only the positive quadrant is interesting, which divides the number of possible 
regions by about 2. Hence, we require 


(2bits) K 


Kregions 
2K 


>1. (14.16) 


Expanding this further, we get 
(2bits)¥ (2bits+1 y (2bits+1 j“ (2bits+1 j 
= > 


esis E Nregions Lio (i) g (%)“ 


(14.17) 
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We used ys (7) < (a) if N > K. From simple rearrangement, we obtain the 
lower bound on the number of bits: 


N 
bits > log, K + logye — 1. (14.18) 


This is a remarkable result, as it shows that necessary bit precision grows only 
logarithmically with the ratio of training examples to weak learners. A low bit 
precision is essentially an implicit regularization that ensures further sparsity. 

The objective function in Equation 14.4 needs slight adjustments. The continuous 
weights mean arbitrary scaling, but when discretizing to low bit depth such scaling is 
not feasible. To maintain desirable margins, we introduce a discrete scaling factor x. 
The optimal value of this parameter is unknown and it depends on the data; hence, 
cross-validation is necessary. Let w; denote discretized weights. Assuming a one-bit 
discretization, the objective function becomes 
K K _ N K ap N 

D wij (> heats) + 2 wi ( = 2 nism) 


1 j=l s=1 


argmin 
w ‘ 


l 


(14.19) 


This equation is indeed a valid QUBO. 
Multibit weights need further auxiliary bits w; for each discrete variable w; to 
express the regularization term. The cardinality regularization becomes 


K dw—1 
Aliw = 5 lo a Wir +1 - ie) a -apta (14.20) 


j=l t=1 


where ¢ > 0 is a parameter, dy is the bit width, and w; is the th bit of the weight wj. 

Minimizing Equation 14.20 causes the auxiliary variables w; to act as indicator 
bits when the matching variable is nonzero. This representation assumes that the most 
significant bit is 1 and all others are 0 if wj. If the inner sum adds up to nonzero, the 
parameter ġ will add a positive value on the first term, which forces wy; to switch to 1. 
This, in turn, is penalized by AWj. 

Experimental results using simulated annealing and tabu search on classical 
hardware verified this theoretical result, showing that the error rate in 30-fold cross- 
validation of 1-bit precision was up to 10% better than that using double-precision 
weights (Neven et al., 2008). Sparsity was up to 50% better. 


14.7 Mapping to Hardware 


The current attempts at developing adiabatic quantum computers are not universal 
quantum computers. They perform quantum annealing at finite temperature on an 
Ising model (Boixo et al., 2013; Johnson et al., 2011). 
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Apart from being restricted to solving certain combinatorial optimization problems, 
there are additional engineering constraints to consider when implementing learning 
algorithms on this hardware. In manufacturing the hardware, not all connections are 
possible—that is, not all pairs of qubits are entangled. The connectivity is sparse, 
but it is known in advance. The qubits are connected in an arrangement known as 
a Chimera graph (McGeoch and Wang, 2013). This still put limits on the search 
space. 

In a Chimera graph, groups of eight qubits are connected as bipartite full graphs 
(K4,4). In each of these groups, the four nodes on the left side are further connected to 
their respective north and south neighbors in the grid. The four nodes on the right side 
are connected to their east and west neighbors (Figure 14.2). This way, internal nodes 
have a degree of six, whereas boundary nodes have a degree of five. 

As part of the manufacturing process, some qubits will not be operational, or the 
connection between two pairs will not be functional, which further restricts graph 
connectivity. 

To minimize the information loss, we have to find an optimal mapping be- 
tween nonzero correlations in Equation 14.4 and the connections in the quantum 
processor. We define a graph G = (V, E) to represent the actual connectivity be- 
tween qubits—that is, a subgraph of the Chimera graph. We deal with the Ising 
model equivalent of the QUBO defined in Equation 14.5, and we map those 
variables to the qubit connectivity graph with a function @: {1,...,n}+> V such 
that (¢ (i), d(/)) € E = Ji; A 0, where n is the number of optimization variables in 
the QUBO. 

We encode ¢ as a set of binary variables ¢jg—these indicate whether an optimiza- 
tion variable i is mapped to a qubit q. Naturally, we require 


X u=! (14.21) 
q 


Figure 14.2 An eight-node cluster in a Chimera graph. 
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for all optimization variables i, and also 
X biq <1 (14.22) 
i 


for all qubits g. 
Minimizing the information loss, we seek to maximize the magnitude of Jj; mapped 
to qubit edges—that is, we are seeking 


argmax X $O Jilbiqhig’. (14.23) 
$ isi (qq)eE 
with the constraints applying to ġ in Equations 14.21 and 14.22. 

This problem itself is in fact NP-hard, being a variant of the quadratic assignment 
problem. It must be solved at each invocation of the quantum hardware; hence, a fast 
heuristic is necessary to approximate the optimum. The following algorithm finds an 
approximation in O(n) time complexity (Neven et al., 2009). 

Initially, let i) = argmax; ja \Jjil + ae |J;j|—that is i; is the row or column 
index of J with the highest sum of magnitudes. We assign i; to one of the qubit vertices 
of the highest degree. 


For the generic step, we already have a set {i1,..., ix} such that ¢ (ij) = qj. To 
assign the next iz+1 ¢ {i1,..., i} to an unmapped qubit gx41, we need to maximize 
the sum of all igs sij| and isiest| over allj € 1,...,k, where (qj, qk+1} € E. 


This greedy heuristic reportedly performs well, mapping about 11% of the total 
absolute edge weight `; j |Jij| of a fully connected random Ising model into actual 
hardware connectivity in a few milliseconds, whereas a tabu heuristic on the same 
problem performs only marginally better, with a run time in the range of a few 
minutes (Neven et al., 2009). 

Sparse qubit connectivity is not the only problem with current quantum hardware 
implementations. While the optimum is achieved in the ground state at absolute zero, 
these systems run at nonzero temperature, at around 20-40 mK. This is significant 
at the scales of an Ising model, and thermally excited states are observed in 
experiments. This also introduces problems on the minimum gap. Solving this issue 
requires multiple runs on the same problem, and finally choosing the result with 
the lowest energy. For a 128-qubit configuration, obtaining m solutions to the same 
problem takes approximately 900+ 100m milliseconds, with m = 32 giving good 
performance (Neven et al., 2009). 

A further problem is that the number of candidate weak classifiers may exceed 
the number of variables that can be handled in a single optimization run on the 
hardware. We refer to such situations as large-scale training (Neven et al., 2012). It 
is also possible that the final selected weak classifiers exceed the number of available 
variables. 

An iterative and piecewise approach deals with these cases in which at each 
iteration a subset of weak classifiers is selected via global optimization. Let Q denote 
the number of weak classifiers the hardware can accommodate at a time, let Touter 
denote the total number of selected weak learners, and let c(x) denote the current 
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weighted sum of weak learners. Algorithm 8 describes the extension of QBoost that 
can handle problems of arbitrary size. 


ALGORITHM 8 QBoost outer loop 
Require: Training and validation data, dictionary of weak 
classifiers 
Ensure: Strong classifier 
Initialize weight distribution douter over training samples as 
uniform distribution Vs: douter(s) = 1/K 
Set Touter + 0 and c(x) < 0 
repeat 
Run Algorithm 7 with d initialized from current douter and using 
an objective function that takes into account the current c(x): 
w* = argminy() $.1[(c(s) + $F- wihi(Xs))/(Touter + Q) — ys]? + 
Allwllo)- 
Set Touter < Touter + llw*llo and c) < c(x) +X? , wihi (x) 
Construct a strong classifier H(x) = sign(c(x)) 
Update weights douter(s) = peste). pter he(X)/Touter — Ys)? 
Normalize douter(s) = once )/ Yeas douter(S) 
until validation error Eyal stops decreasing 


QBoost thus considers a group of Q weak classifiers at a time—dQ is the limit 
imposed by the constraints—and finds a subset with the lowest empirical risk on 
Q. If the error reaches the optimum on Q, this means that more weak classifiers are 
necessary to decrease the error rate further. At this point, the algorithm changes the 
working set Q, leaving earlier selected weak classifiers invariant. 

Compared with the best known implementations on classical results, McGeoch 
and Wang (2013) found that the actual computational time was shorter on adiabatic 
quantum hardware for a QUBO, but it finished calculations in approximately the 
same time in other optimization problems. This was a limited experimental validation 
using specific data sets. Further research into computational time showed that the 
optimal time for annealing was underestimated, and there was no evidence of quantum 
speedup on an Ising model (Rønnow et al., 2014). 

Another problem with the current implementation of adiabatic quantum computers 
is that demonstrating quantum effects is inconclusive. There is evidence for corre- 
lation between quantum annealing in an adiabatic quantum processor and simulated 
quantum annealing (Boixo et al., 2014), and there are signs of entanglement during 
annealing (Lanting et al., 2014). Yet, classical models for this quantum processor are 
still not ruled out (Shin et al., 2014). 
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14.8 Computational Complexity 


Time complexity derives from how long the adiabatic process must take to find the 
global optimum with high probability. The quantum adiabatic theorem states that the 
adiabatic evolution of the system depends on the time t = tı — fo during which the 
change takes place. This time is proportional to a power law: 


Coe (14.24) 


where gmin is the minimum gap in the lowest-energy eigenstates of the system 
Hamiltonian, and ô depends on the parameter à and the distribution of eigenvalues 
at higher energy levels. For instance, 6 may equal 1 (Schaller et al., 2006), 2 (Farhi 
et al., 2000), or, in certain circumstances, even 3 (Lidar et al., 2009). To understand the 
efficiency of adiabatic quantum computing, we need to analyze gmin, but in practice, 
this is a difficult task (Amin and Choi, 2009). 

A few cases have analytic solutions, but in general, we have to resort to numerical 
methods such as exact diagonalization and quantum Monte Carlo methods. These are 
limited to small problem sizes and they offer little insight into why the gap is of a 
particular size (Young et al., 2010). 

For the Ising model, the gap size scales linearly with the number of variables 
in the problem (Neven et al., 2012). Together with Equation 14.24, this implies 
a polynomial time complexity for finding the optimum of a QUBO. Yet, in other 
cases, the Hamiltonian is sensitive to perturbations, leading to exponential changes 
in the gap as the problem size increases (Amin and Choi, 2009). In some cases, we 
overcome such problems by randomly modifying the base Hamiltonian, and running 
the computation several times, always leading to the target Hamiltonian. For instance, 
we can modify the base Hamiltonian in Equation 14.8 by adding n random variables c;: 


Hg = = ae (14.25) 


Since some Hamiltonians are sensitive to the initial conditions, this random perturba- 
tion may reduce the small gap that causes long run times (Farhi et al., 2011). 

Even if finding the global optimum takes exponential time, early exit might yield 
good results. Owing to quantum tunneling, the approximate solutions can still be better 
than those obtained by classical algorithms (Neven et al., 2012). 

It is an open question how the gapless formulation of the adiabatic theorem 
influences time complexity. 
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